Rinaldo B. Schinazi 


Probability 
with Statistical 
Applications 


Third Edition 


® Birkhauser 


® Birkhauser 


Rinaldo B. Schinazi 


Probability with Statistical 
Applications 


Third Edition 


® Birkhauser 


Rinaldo B. Schinazi 
Department of Mathematics 
University of Colorado 
Colorado Springs, CO, USA 


ISBN 978-3-030-93634-1 ISBN 978-3-030-93635-8 (eBook) 
https://doi.org/10.1007/978-3-030-93635-8 


Mathematics Subject Classification: 60-01, 60Axx 


© Springer Nature Switzerland AG 2001, 2012, 2022 

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of 
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, 
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information 
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology 
now known or hereafter developed. 

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication 
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant 
protective laws and regulations and therefore free for general use. 

The publisher, the authors and the editors are safe to assume that the advice and information in this book 
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or 
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any 
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional 
claims in published maps and institutional affiliations. 


This book is published under the imprint Birkhauser, www.birkhauser-science.com by the registered 
company Springer Nature Switzerland AG 
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland 


Preface to the Third Edition 


This edition involves a major reorganization of the book. I have split the material 
into many more chapters. Many chapters are now largely independent, and this 
allows for more flexible teaching. The separation between discrete and continuous 
distributions is sharper. 

The style has been changed as well. In particular, there are no longer boxed 
formulas. In my experience, students focus too much on the formula and try to 
apply it without noticing differences between problems on the same topic. What has 
not changed is my teaching philosophy. For each topic, I concentrate on few critical 
concepts, and I teach them by using many examples. 

I have used the first 14 chapters for a first course in probability with statistical 
applications. This covers all the classical discrete probability distributions and a few 
continuous distributions (uniform, exponential, and normal). Confidence intervals 
and several statistical tests are also covered. 

I have used some of the material covered in Chaps. 15 through 24 for a second 
course in probability. Starting with Chap. 15, the mathematical level of the book 
increases. Calculus of one and several variables is assumed to be known. The level of 
abstraction is also higher. For instance, we use cumulative distribution functions and 
moment generating functions. We introduce the classical continuous distributions 
as transformations of basic distributions. We study joint continuous distributions, 
covariance, and correlation. We also provide several applications. 

The last part of the book is devoted to mathematical statistics. Starting in 
Chap. 25, we cover estimation (method of moments, maximum likelihood, and 
Bayes’ estimation) and comparison of estimators. I have used material from 
Chaps. 25 through 29 (supplemented by some previous chapters) for a course in 
mathematical statistics. 


Colorado Springs, CO, USA Rinaldo B. Schinazi 


Contents 


1 Probability Spate .. 2.202.002. cc.cc cc cc eieee scons te es onceecsentosr ces ebesnees 
L  Eegitally Likely OU Gonies «occ. ccid oder sor diaesincshebequesseecnnuies 

2 ‘The Axiomeol ProbaBiiy soi caccccsscencece sin gaa yaoedns ed nae dasnemiewe 

2 Conditional Probabilities ............. 00... e cece eee cence eens 
V WOBGMGOt Sey sanecininanavenesage sd seaash sess senedeteseneecuesececeteses 

2 Wayes” Meme oo ..63.ciieseds ieee peels bale cha dei seals teieeeieaes 

SF CMI sei cre sedges Satiuewcus abit S gueu vie Sinners eestaesaheseneees 

A “Wee Penilence: «ice ccles ude: ot buoee a nevis dee ies sees ecee ets ben 

> “Dine Barthay Prsbleit i. osccscesssdassaestenstensdeceseceacketenedeneess 

3 Discrete Random Variables.........................:e cee ee ence e eee eee ees 
I DSc rete DS MAONS os. sis 5d iecsine edniecre notte ews oodemeddbeeeeers eons 

1.1 Bernoulli Random Variables ............... 0... cece e cece eee ee eee 

1.2 Geometric Random Variables ............... 2... cece eee eee eee 

DO EP MAOIs asia cnc g-c vases oicvenssostiw asniniendsoisisndipisvensitiarS\breigin.0.e.ginsare dine Gleave dinvace wiaiaee ee 

ol “The EXpectauon Of 8 SUM iv iscccces ce scsnsdannees oodesemdtoneses 

Ae NPAT oo, 55a 5s ione nesicns Susresaiciniteainveishe cusiera overs. g-susreiassroidua-mnsieia-siana oe an nee e eae eee 

3.1. Variance and Independence... ..6c0 0.0. ceecces see heeeesioeseenes 

4 Coupon Collector's Problem ice. ccccssecceseinss vais ones eeetasaenbe des 

4 Binomial Random Variables................0..... 0. cece ee eee eee nee eens 
1 Binomial Probability Distribution ............ 0.0... cece cece cece ee eee ee 

2 Mean and Varnes: oc seccdisce ben actadease nets cele acts teks ac eded 

2.1 Derivation of the Binomial Distribution.......................... 

So Normal Approximnaias ces. nesses eras ees oe eke se Seoeed es bee 

SL The Normal TA... vicasis scss se5asdaehecsaeve cadeacke tote ceveaes 

a2. Monmial A prroxwiauan: 65 2646 cas eck lcci s o0h8 Sots Sead bine oka 

a The Nesative. BiG ois. occa veces keieseeeseresemvsosstteceneees 

5 Poisson Random Variables ......................secee ence ee eneeeeeaee enna 
1 Poisson Probability Distribution «24.56. cc. sce scene ees cae eee eee 

D2 POisset Seater THE OTEIW i.e. cisiesicisinssscinseiacintciacnstineae tans chow daetiees ee 


viii 


10 


Contents 

3 Poisson Approximation to the Binomial.................. esses eee eee ee 59 
4 Approximation to a Sum of Binomials ............... 0... e cece ee eee eee 60 
Meat ail VaVIAAGS  oecscus siciscesiesdearearen ca ented dae Os odds Mea Cogs eee woes 63 
Simulations of Discrete Random Variables ......................00000005 65 
1 Random Numbefs................ ccc ccc cece cece eee eeeseeeneeeeeeeeenneeta 65 
2 Bernoulli Random Variables ........... 0.0... cece cece eee eee eeeeeeeeee 66 
3 Binomial Random Variables ............... cc ccc cece cece eee e eee eeeeeeee 67 
3.1 Computational Formula for the Binomial Distribution .......... 67 

4 Poisson Random Variables ................ccec cece eee e cence teen eeeneeena 69 
COMIDINALGIIES «os Secccc cies cine ese ctineniewaaiene tae an mm neds sade dad Sede wees 73 
Lo Qepnting Principle. eco sest saves ssanse vente sensaevd cateackeseteaeteses 73 
2 Properties of the Binomial Coefficients ........... 0... cece cess ee eee eee 77 
3  Hypergeometric Random Variables ............ 0... sees sees seen eeeeee eee 80 
4 Mean and Variance of a Hypergeometric ............. 0. cece ee eee eee eee 83 
5 Conditioning on the Number of Successes .............. esse eee eee eee ee 86 
Continuous Random Variables.....................0s0eeeeeeeeeeeeeeeenees 89 
L Probability Denes 5 ....0o.se2ceu. ceteris art cnhivnesenhenaosmedacebes 89 
2 Uniform Random Variables ............ 00... cece cee e cee e cece eee eeeeee 90 
3 Exponential Random, Variables ....0.......005.0 05 cccs senses sneer nese 92 
Sel DlSmOryless PROPGHY «ec cctsiiesssanedsae seis dacs ania tase ca peaes 93 

q Expected Vale. ici cee. dibs daeetdudec alive dsostleeenmediboeten vices sd 94 
4.1 Symmetric Probability Density .................. 0c cece eee 96 
4.2 Function of a Random Variable............... 0. ccc ccc eee eee eees 96 

Or RV NSGUABL Ss is'cesscid. ord wincshcisidis wiv dcud.cemarc arene wieinsdtiodwisieG Poot oeet ERG NG 97 
 WANATIOS os. iccesotay. ceineeituoeGe ahtemennbesorensekauedcoadoommbbadwune 98 
7 Normal Random Variables ................. ccc cc eeee cee eeeeeeneeeeaeeee 102 
FL “The Standard Noriall o.oi.5os.0cccc6scncnc ce secccesehencneeabeccciene 102 
7.2 Normal Random Variables ................. cece ccc eeeceeeeeeenees 103 

7.3 Applications of Normal Random Variables ...................... 105 

7.4 Expectation and Variance of a Standard Normal................. 105 
The Sample Average and Variance ..................0..eeece sence eeenees 109 
Li Whe Saniple: Avera o.oo scdaeivwn ies doveuast eseseegeversacne eevee. 109 
2 The Cetitral Latiit TRGOneiy eos. csies ca ceiees Cons es adele bed Sens sess vows 111 
Wie Saniple Vanni? ioc. isso ewd ussote th eatiensders cecweehe benseene eas 114 
4.) Monte (arho InteB ramen 1.23 5ccaseni nets wid acta oets bety ood cote ees Gat 116 
Estimating and Testing Proportions ........................ceeeee eee eee 119 
To “Westie 4 PROPOMION wos face ace irsodsaceiseeeeugeoaes daa sdase Shae eto sees 119 
2 Confidence Interval for a Proportion: «2... .0.c secs once ane eawes 123 
S Testing Two PreponiGne..... cc.ccio. scans sea be tassgaae dese eeee sesaenee ten 126 
4 Confidence Interval for Two Proportions ...................eeeeeee eee 128 


Contents ix 


11 Estimating and Testing Means ...................c cc eee cece eee e eee eens 131 
I “Wes ei WGA so. osha den een vvdents So tees a ceneeeeesteaecheceneets 131 

2 Contdence Interval Tor a MOAI 34 osc cac sock ccs cook cdc beavers ces cows 134 

SD. “RESUS TWO CAS cos vices wocteashvdusectsheustanh lane cecwerhe beneoeee eas 135 

4 Two Means Confidence Interval ............ 0... ccc ccc cece cece eee ene eens 136 

TZ ‘Smoalll Samples o.oo... cee. ce eee ee tee een eedcineedesewbeenee es 139 
Ue  SeSAG TSS. fi icoss. as diare doessteys oss doses adaleed ebsibe Suise oo asad ones Daan eonens 139 

2 “Two Means Student Tests .....c. 0 ..ccccceenccens crew sesisnene nw eeeseene 141 

3 Student Tests for Matched Pairs.............. ccc cece cece eee c eee eeeeeeee 142 

4], THE SiG TRS6 leche ek ooiee a oedenteandeinede minh ams camandnaleces os 143 

13 «Chi-Squared Tests... 0.000. :cceei sac cas ee eiscesesesisssieessenscete ses 147 
I. Westine Undependence:. «23.6 1258 neag ocak noir ecid bias 00d cedauseacteaanenees 147 

2 Croodmesstit Fat TES occ cescus ee cvsaenvrtisvus Sewesdemieneenteceweees 149 

14 Design of Experiments .................... 02. ccccece eee e ence eee eeeeneeeeees 155 
1. Doble Bld Desi icin cassia nesecssacesdenseenncseaceqmeseatoqaasns 155 

Dy Bile AN ga: 0:5s 0:5: n isin arcinss a sinienaseisinssinisie)aveisieia ciniensreisinipve Qisip.e SRbeaimectiainie-e Deis 156 

15 The Cumulative Distribution Function ........................ccceee ees 159 
| Dehn and Examples: .i.c0c...e.8500s5ds biescaserd caseeehe besa geneees 159 

2 Transformations of Random Variables.................0cc cence eene eens 161 

3 Sample Maximum and MinimMumm.....50.25c02s0ssscrecesvsevesnteceeess 165 

A PT AS iscssis ad Sede Goa eda aren share Sormnune sae Gonna eee Abe OES SEER w EES 167 

16 Continuous Joint Distributions .................... 0. ccc ccc cece eee eens 169 
1 Jointand Marcinal Densiies:: -ecccciiscccciaesaseusedesy eve sassesaetes 169 

@. WDepenlenes +s ecaie wes sittndes andiveds audtndcoumeemmenseniadbacenes 171 

3 Transformations of Random Vectors..............cccceeeeceeeeeeeeeeee 172 

4 Gamma and Beta Random Variables.................cceeccceeeeeeeeeeee 178 

4.1 The Function Gamma ...............ccccesecsenevenevscsenevceeers 178 

4.2 Gamma Random Variables.................cceeecceeeeeeeneeeenees 179 

4.3. The Ratio of Two Gamma Random Variables.................... 181 

4.4 Beta Random Variables ................cccccceee cc eeneeeeneceenees 182 

17 Covariance and Independence .......................eeeee sence eeene eens 187 
DS DAIS es aigevasesdrieasecerp eins Sais bin ss esduelers ecpensdestsmmiscyndeerdwarenmObennecetaneet 187 

2 MMMCPERAENGS: coos eii cs olcesseessoatbeedemeimerscasncrsaessesiaaeios des 189 

Bie “ASD SUVA 5 555.0: siaigncos ace dveignsnssestnosgsrigaqna-ouge dreigesannarereigsoeeeneo aisle aseinrendisuumeboeigneanee 190 

A Wariance: Of @ SUN cise accsnsecuinepameradioseesetediassBiobinsagmeaces 191 

3 Proof That the Expectation Is Linear ...2..0.06002c0cc0scenessndeeeee es 193 

6 Proof That the Correlation Is Bounded ................. ccc cece cece ee eee 194 

18 Conditional Distribution and Expectation ...........................045 197 
MPS CREE Ae oo aden esc hweted 208 ccnn Terese c eu eeca wi enesn te cew wees 197 

2 Carib ABC se beds sccsaresive Rood diniew mss soon wees eae ors Eon eeu’ 201 

S Cepdional Bxpectanen 1. 0cspsesesesscese san edete seseaesesenedesests 204 


3.1. Conditional Expectation and Prediction ........................4. 206 


19 


20 


21 


22 


23 


24 


Contents 


The Bivariate Normal Distribution ....................... cece e eee eee 209 
DDS Serra cai So Sse Gaeta wis S eeeeety Sau vege eeteaeehecemeins 209 
2 A POCA cic cdis sa ods ote ied ca qa seh adeeb oce seas hen 211 

BA Best Prete ncscccctdcacvstadusnenals hetecacend indeed chy tensouteees 212 
2 The Jomt Probability Density oi. boc eed bins ond cei s ork d oehg eek SA 214 
4 The Conditional Probability Density ............5......ssseesccesee sees 215 
Sums of Bernoulli Random Variables ............................25005 219 
l. The Expected Number of Bathdays «2.0 ..0.5s005s00sseeseeescneeeiede 219 
4 The Matching Propet... coisc cis cies socceesstess abe dasa eeeedacaenbedes 221 

21 Expected Number of Matches «0.0. ..606.0ss0. 00008 ce nese enews cee 221 

Sie WATIANEGE OL GSU. oceiaicoiossans ines cap hens deseeaesansaene aces 222 

2.3. Variance of the Number of Matches .......................e eee 223 
3 The Moments of the Hyperoeomiettic..... 2.5.2... 0.542.0.c0sscceseeseees 224 
4 "The Number Of Records). ns .cec.icdeedsae teense tlecenmendeoenoemienss 226 
Coupling Random Variables .................. 0... e eee eeeee cence eeenees 231 
1 Coupling Two Bernoulli Random Variables ........................0055 231 
2 Coupling Two Poisson Random Variables....................0...e eee 232 
a Che Coupling Unequalty si. osisccahosnegis ocd sees cei sdesseegecs sess beh 232 
4 Poisson Approximation of a Sum ............. eee cece e eee eens 234 

4.1 Poisson Approximation of a Binomial .......................0085 235 
5 Proof of the Poisson Approximation. ...........65..00..ssensseeseeesees 237 
The Moment Generating Function ......................eeeee eee e ee ee ees 241 
1) Deéhinition atid Examples:...cc..ics.cceoerevcvevassenag aveeninescaeeeiedes 241 

1.1 Sum ofi.id. Bernoulli Random Variables........................ 243 

1.2 Sum of Independent Poisson Random Variables................. 244 
2 The OCT, OF a NOMAal oi .50: cick dei cee ss acisiesceaesesaeeassceanene ens 246 
3 Moment Computations i... ccc ceded teeriserewecnnesokeenesseeeenieee 248 
4 Convergence in Dist bition. ...0.5.3....0cscaeb eee soa 54 dass ease sasseese nes 252 

4.1 Binomial Convergence to a Poisson ................e eee eeee ee eee 252 

4.2 Proof of the Central Limit Theorem ..........................0085 254 
Chi-Squared, Student, and F Distributions.............................. 259 
1 The m.g.f. of aGamma Random Variable ...................... cece 259 

1.1 Sum of i.i.d. Exponential Random Variables..................... 260 
2 ‘The hie squdred Distribute... vc0c30. 5 ecvsessccss ese devs sens ceeeees 262 
So ‘The ptudent Disibutlon «co ccc6s ce ee ihnais oid bois enh d orig eed Sen 264 
A TSE SSC DWMON a. calves cece vet ietnde tess daensetiesettecemees’ 265 
Sampling from a Normal Distribution ........................ es eee eee ee 269 
I The Saniple Average and Variance... occ ce sss seen sc seaseinessaeewean cs 269 
2 ‘The sample Average Is Notimal.. .<...c0:22isoc.ssciaaiesseesecasseraates 273 
3 The Saniple Variance DistibUtion ..2.....00.6600.0.ccceesenesesetenee es 274 
A. “Whe Standardized AVersGe o.oo cies dec teass co yeaeh csseeass casters snes 276 


Contents xi 


25 Finding Estimators.......002...005 50.5 sbeebs civ eci scenes bes 279 
lL. The Method OF Miomeie sci ss cients sehen cedenedst ves uecceedeneias 279 

2 The Maximum Likelihood Method ............ 0... cece cesses eee eee e eens 284 

26 Comparing Estimators ................ 0... cc cece eee tence eee neeeeeaee eens 291 
1. ‘The Mean Squared Bitar... ccc i ssecncniy teks babe cnss dese Sasa tie wes 291 

2 Biased and Unbiased Estimators ............ 0... c sees sees cece eee e enna ee 292 

3 Two Estimators for a Normal Variance .............. see eee seen eee eee e ee 293 

4 Two Estimators for a Uniform Distribution ....................... 02 eee 295 

2 Proot OF the MSE. POs oases cieonctsccesacaaetees ceaeancceaae tes 297 

27 Best Unbiased Estimators .................. cc cee cece cence ee eneeeenaee eens 301 
1 Exponential Families of Distributions ......................e cece eee 301 

2 Minimum Variance Unbiased Estimators............. 0... cece seen eens 303 

DS SURICIE I AUIS 5.05.05 San bees ns eaeaeseveasetecoesateceweass 307 

4 <A Factorization Thearem . vivivis.ccccccs. cess ees eeieacae dees cds eceaades 308 

5 Conditional Expectation and Sufficiency ................e eee eee eee 311 

2S Bayes” Fistimiator goose che sscnsehaenwoncicannegceh es tans eh badesetasates 317 
1 The Prior and Posterior DisiribOtions .......02.00+0. c5105 cence eedtwne oe 317 

2 Wages ESGMAROlS. 60. jackienisncisetscamsstisswabeesaetins cea seneeuns 319 

29 Multiple Linear Regression.................0...e cece eeee eee neeeeeaee eens 327 
| The Least Squares Bstumate. so. .... ccs ec cezsst eseneseevereescts covers 327 

2. POMS UOAN TEMG sce cece ceed tawsceaherescenenmnads de wecsseee ets dea meek 330 

ZU SUAS GE SQUIER 2 <.ccscsaieedntsess vena socedyee sasesctehenserse bes 331 

Dol. Wn Ste. oooh bao Cab ide esdsiigecdb icisesi kes 332 

22° sisoicance or the Model ... csceccavsceasstetsonsss vstneseaedevsce. 333 

2 Estimating the Vananee:. 3 .cccsccissck eed e eS be ees ethos 334 

2.5 Testing Individual Regression Coefficients ...................... 334 

D PPO bilo is eee eee ed ete dase onsen 336 

Sl “The Nordial Pauaions: si... 00sec csccncetesenagenveroesstesceees 336 

3.2 Partitioning the Sum of Squares ................. ccc eee e eee eee 337 

3.3. Expectation and Variance of a Random Vector .................. 338 

3.4 Normal Random Vectors .......... 00... cece cece cece cece eee e eee ee ee 341 

List of Common Discrete Distributions .....................0..e eee ee ee ee eee 345 
List of Common Continuous Distributions .......................... eee e eee 347 
Birrthet Reggie «oi... eicciegamagehceuscevhanviesstebeds sabe eetess gee ndeseee tes 351 
Standard Normal Table .................. 0... e cece e nee e cence ee enee ee eaeeeenaees 353 
UU GIE TAME: coo o0 ida dada oeshoesd ladectgenioaen hbetehipennehan ab secebenecins 355 
Chi-Samared "Table. oi o.5.)o esas dy bona ceeds be eee beneegeparmsctteneseces 357 


Chapter 1 ®) 
Probability Space sei 


1 Equally Likely Outcomes 


The study of probability is concerned with the mathematical analysis of random 
experiments such as tossing a coin, rolling a die, or playing at the lottery. Each time 
we perform a random experiment, there are a number of possible outcomes. 


¢ The sample space © of a random experiment is the collection of all possible 
outcomes of the random experiment. 
¢ An event is a subset of Q. 


Example I Toss a coin once. There are only two possible outcomes, heads or tails. 
The sample space is Q = {H, 7}. The event A = {H} is the event “the outcome 
was heads.” 


Example 2 Roll a die. This time the sample space is Q = {1,2,3,4,5, 6}. The 
event B = {1, 3, 5} can also be written as the event “the die showed an odd face.” 


Example 3 The birthday of someone. The sample space is the set consisting of the 
365 days of a year. 


Example 4 We count the number of rolls until we get a 6. In this case 
= f{1,2, cash 


That is, the sample space consists of all strictly positive integers. Note that this 
sample space has infinitely many elements. 


Example 5 Roll a fair die. Then Q = {1,2,3,4,5, 6}. If all the outcomes are 
equally likely, we define 


P(i) = 1/6 fori = 1,...,6. 
© Springer Nature Switzerland AG 2022 1 


R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_1 


2 1 Probability Space 


What is the interpretation of the statement P(1) = 1/6? If we roll the die many 
times, the frequency of observed 1’s (that is, the observed number of 1’s divided by 
the total number of rolls) should be close to 1/6. 

More generally, we have the following: 


¢ Consider a finite sample space Q with equally likely outcomes. Let |A| be the 
number of elements in A. Then we define the probability of event A as 


Pas. 
| 92] 


Note that the definition above only makes sense for a finite (2. 

e Many problems such as in the examples below will involve several objects such 
as dice or coins. In order to end up with a sample space with equally likely 
outcomes we will think of the different dice or coins as having different colors. 
In other words, the different objects will always be thought to be distinguishable. 


Example 6 Toss two fair coins. We think of one coin being red and one being 
yellow. We first list the outcome for the red coin and then for the yellow one. We get 
four outcomes Q = {HH, HT, TH, TT}. These four outcomes are equally likely. 
Hence, 


(HH, HT, TH}| 


P(at least 1 head) = mi = 3/4. 


Example 7 Roll two dice. What is the probability that the sum is 11? The most 
natural sample space is all the possible sums, that is all integers from 2 to 12. But 
these outcomes are not equally likely so it is not a good choice. Instead think of the 
two dice as being distinguishable and pick for Q the collection of all ordered pairs, 
Q = {(1, 1), (1, 2),..., (1, 6) 
(2, 1), (2, 2),..., (2, 6) 


(6, 1), (6, 2),..., (6, 6)}. 


There are 36 equally likely outcomes in Q2. There are only 2 ordered pairs that 
yield a sum of 11. They are (5, 6) and (6, 5). Hence, 


2 
P(sum is 11) = 36° 


2 The Axioms of Probability 3 


Example § Roll two dice. What is the probability that the sum of the two dice is 4 
or more? It is quicker to compute the probability that the sum is 3 or less which is 
the complement of the event we want. We get 


P(sum is 3 or less) = HG, D, oo @, DI = 3/36. 


Therefore, 


P(sum is 4 or more) = 1 — 3/36 = 33/36. 


2 The Axioms of Probability 


In this section we will be a little more formal and general. In particular, we will give 
rules of probability that apply even when outcomes are not equally likely. 
We start by defining some useful set operations: 


e Assume A is an event (that is, a subset of the sample space 2) and then the event 
consisting of all the elements of (2 NOT belonging to A is called the complement 
of A and is denoted by A‘. 

e Assume that A and B are two events and then the intersection of A and B is the 
set of all the elements in © that are both in A and B. The intersection of A and 
B is denoted by AM B or by AB. 

e Assume that A and B are two events and then the union of A and B is the set of 
all the elements that are in A or in B (they may be in both). The union of A and 
B is denoted by A U B. 

e The empty set is denoted by ¥. Two events are said to be disjoint or mutually 
exclusive if 


AB=Q. 
More generally, the events A;, A2,... are said to be mutually exclusive if 
AjAj = fori F j. 


Example 9 Let A be the event that a student is female, B the event that a student 
takes French, and D the event that a student takes calculus. 

What is the event “a student is female and takes calculus”? We want both A and 
D, so the event is AD. 

What is the event “a student does not take calculus”? We want everything not in 
D, so the event is D°. 

What is the event “a student takes french or calculus”? We want everything in B 
and everything in D, so the event is B U D. 


4 1 Probability Space 


We now state the three axioms of probability. 


The Axioms of Probability 
(i) For any event A in Q,0 < P(A) < 1. 
(ii) P(Q) = 1. 
(iii) For a finite or infinite sequence of disjoint events Aj, A2,..., 


P J 4i =) P(Ai). 
i=l i=l 


We now list several consequences of these axioms: 


C1. If AB =4Y, then by (iii) 
P(AUB) = P(A) + P(B). 


C2. P(AS) =1-— P(A). 
We now prove C2. Note that 


AUAS=Q. 

Hence, 
P(AUAY) =1. 
Since AA‘ = G, by Cl 
P(AU A‘) = P(A) + P(A‘). 
Hence, P(A‘) = 1 — P(A) and C2 is proved. 
C3. P(@) =0. 
Observe that Q° = 4 and by C2 
P(Q) =1-— P(Q)=1-1=0. 


This proves C3. 
C4. For any events A and B, 


P(A) = P(AB) + P(AB‘). 


We now prove C4. Note that an element in A is either in B or not in B. 
Hence, 


A= ABUAB*. 


2 The Axioms of Probability 5 
Using that ABM AB* = % (why?), we get by Cl, 
P(A) = P(AB) + P(AB‘). 


This proves C4. 
C5. For any events A and B, 


P(AUB) = P(A) + P(B) — P(AB). 


Observe that an element in AU B can be in B or not in B. But if the element 
is in A U B but not in B, it must be in AB (why?). Hence, 


AUB=AB‘UB. 
Since AB* and B are disjoint, we have by C1, 
P(AU B) = P(AB‘) + P(B). 
Using now C4, we get P(AB‘) = P(A) — P(AB). Therefore, 
P(AU B) = P(A) — P(AB) + P(B). 


This proves C5. 


Example 10 We pick at random a person in a certain population. Let A be the 
event that the person selected attends college. Let B be the event that the person 
selected speaks French. Assume that the proportion of persons attending college 
and speaking French in the population are 0.1 and 0.02, respectively. That is, 
P(A) = 0.1 and P(B) = 0.02. Assume also that the proportion of people attending 
college and speaking French is 0.01. That is, P(AB) = 0.01. 

What is the probability that a person picked at random does not attend college? 

This is the event A‘. Hence, 


P(A‘) = 1— P(A) = 0.9. 
What is the probability that a person picked at random speaks French or attends 
college? 
This is the event A U B. Thus, 
P(AU B) = P(A) + P(B) — P(AB) = 0.1 + 0.02 — 0.01 = 0.11. 


What is the probability that a person speaks French and does not attend college? 
This is the event A° B. According to C4, we have 


P(A‘B) = P(B) — P(AB) = 0.02 — 0.01 = 0.01. 


6 1 Probability Space 
Problems 


1. Toss three fair coins. 


(a) What is the probability of having at least one head? 
(b) What is the probability of having exactly one head? 


2. Roll two fair dice. 


(a) What is the probability that they do not show the same face? 
(b) What is the probability that the sum is 7? 
(c) What is the probability that the largest face is 3 or more? 


3. A roulette has 38 pockets, 18 are red, 18 are black, and 2 are green. I bet on red, 
you bet on black. 


(a) What is the probability that I win? 
(b) What is the probability that at least one of us wins? 
(c) What is the probability that at least one of us loses? 


4. Roll 3 dice. 


(a) What is the probability that you get three 6’s? 
(b) What is the probability that you get a triplet? 
(c) What is the probability that you get a pair? 


5. I buy many items at a grocery store. What is the probability that the bill be a 
whole number? 

6. Roll two dice. What is the probability of getting at least one 6? 

7. A die with 8 faces has one face numbered | through 4 and two faces numbered 
5 and 6. 


(a) What is the probability of rolling a 1? 
(b) What is the probability of rolling a 6? 
(c) What is the probability of rolling a 4 or more? 


8. A die with 20 faces is numbered from | to 20 in such a way that opposite faces 
sum 21. 


(a) What is the probability of rolling a 6? 
(b) What is the probability of rolling a 6 or more? 
(c) Rolling two such dice what is the probability of getting a sum of 7? 


9, Pick a card at random from a 52 cards deck. 


(a) What is the probability that the card is a heart or a spade? 
(b) What is the probability that the card is a queen or a heart? 


Problems 7 


10. 


11. 


12. 


13. 


14. 
15. 
16. 


Let A be the event that a person attends college and B be the event that a 
person speaks French. Using intersections, unions or complements describe the 
following events. 


(a) A person does not speak French. 

(b) A person speaks French and does not attend college. 

(c) A person is either in college or speaks French. 

(d) A person is either in college or speaks French but not both. 


Let A and B be events such that P(A) = 0.6, P(B) = 0.3, and P(AB) = 
0.1. 


(a) Find the probability that A or B occurs. 

(b) Find the probability that exactly one of A or B occurs. 

(c) Find the probability that at most one of the two events A and B occurs. 
(d) Find the probability that neither A nor B occurs. 


In a college it is estimated that 1/4 of the students drink, 1/8 of the students 
smoke, and 1/10 smoke and drink. Picking a student at random: 


(a) What is the probability that the student does not drink nor smoke? 
(b) What is the probability that a student smokes or drinks? 


Assume that P(A) = 0.1 and P(AB) = 0.05. 


(a) What is the probability of A occurs and B does not occur? 
(b) What is the probability that A or B does not occur? 


Roll three dice. What is the probability of getting at least one 6? 
If A C B, show that P(A°B) = P(B) — P(A). 
Show that for any three events A,B,C we have 


P(AUBUC) = P(A)+P(B)+ P(C)— P(AB)— P(AC)— P(BC)+P(ABC). 


Can you guess what the formula is for the union of four events? 


Chapter 2 ®) 
Conditional Probabilities hook for 


1 Definition 


We start with an example. 


Example I Roll two dice successively and observe the sum. As we observed before, 
we should take for our sample space the 36 ordered pairs. Let A be the event “the 
sum is 11.” Since all the outcomes are equally likely, we have that 


pay = LG9 GD _ yg 
36 
Let B be the event “the first die shows a 6.” We are now interested in the following 
question: if we observe the first die and it shows a 6, how does this affect the 
probability of observing a sum of 11? In other words, given B, what is the 
probability of A? The notation for the preceding probability is 


P(A|B) 


and is read “probability of A given B.” Given that the first die shows a 6, there is 
only one possibility for the sum to be 11. The second die needs to show 5. The 
probability of this event is 1/6. Thus, P(A|B) = 1/6. 

More generally, we have the following definition: 


e Assume that P(B) > 0. The conditional probability of A given B is defined by 


P(AB) 
P(B) © 


P(A|B) = 


Example 2. We pick at random a person in a certain population. Let A be the event 
that the person selected attends college. Let B be the event that the person selected 


© Springer Nature Switzerland AG 2022 9 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_2 


10 2 Conditional Probabilities 


speaks French. Assume that P(A) = 0.1, P(B) = 0.02, and P(AB) = 0.01. Given 
that the person we picked speaks French, what is the probability that this person 
attends college? 

We want 


P(AB) 


P(A|B) = PB) 


= 0.01/0.02 = 1/2. 


Given that the selected person attends college, what is the probability that this person 
speaks French? This time we want 


P(AB) 


P(B\A) = nae 


= 0.01/0.1 = 0.1. 


Given that the selected person attends college, what is the probability that this person 
does not speak French? 


P(B°|A) = 1— P(BJA)=1-0.1=0.9. 


The previous two examples show how to compute conditional probabilities by 
using unconditional probabilities. In many situations it is the reverse that is useful, 
the conditional probabilities are easy to compute, and we use them to compute 
unconditional probabilities. The following rule will be useful for that: 


¢ The definition of conditional probability is equivalent to the following multipli- 
cation rule, 


P(AB) = P(A|B)P(B). 


Example 3 A factory has an old (O) and a new (N) machine. The new machine 
produces 70% of the products and 1% of these products are defective. The old 
machine produces the remainder 30% of the products and of those 5% are defective. 
All products are randomly mixed. What is the probability that a product picked at 
random is defective and produced by the new machine? 

Let D be the event that the product picked at random is defective. Note that the 
following probabilities are given. 


P(N) =0.7, P(O)=0.3, P(D|N) =0.01 and P(D|O) = 0.05. 
We want the probability of DN. By the multiplication rule we have 


P(DN) = P(D|N)P(N) = 0.01(0.7) = 0.007. 


1 Definition 11 


Assume now that we are interested in the probability that a product picked at 
random is defective. We can write 


P(D) = P(DN) + P(DO). 


That is, a defective product may come from the new or the old machine. Now we 
use the multiplication rule twice to get 


P(D) = P(D|N)P(N) + P(D|O)P(O) = 0.01(0.7) + 0.05(0.3) = 0.022. 


That is, we get the overall defective proportion by taking the weighted average of 
the defective proportions. 

In the previous example we used the following rule that is the key to solving 
many problems: 


¢ Rule of averages. For any events A and B, 
P(A) = P(A|B)P(B) + P(A|B®) P(B‘). 


More generally, if the events B}, Bo, ..., B, are mutually exclusive and if their 
union is the whole sample space Q, then 


P(A) = P(A|B1) P(B1) + P(A|B2) P(B2) + +++ + P(A|Bn) P(Bn). 


We now apply the rule of average to another example. 
Example 4 We have three boxes labeled 1, 2, and 3. Box 1 has | white ball and 2 
black balls, Box 2 has 2 white balls and 1 black ball, and Box 3 has 3 white balls. 
One of the three boxes is picked at random, and then a ball is picked from this box. 
What is the probability that the ball picked is white? 

Let W be the event “the ball picked is white.” We use the rule of average and get 

P(W) = P(W|1) PC) + P(W|2)P(2) + P(W|3) P(3). 
The conditional probabilities above are easy to compute. We have 
P(W\1) =1/3,  P(W|2)= 2/3, P(W|3) =1. 

Thus, 


P(W) = 1/3 x 1/3 +2/3 x 1/3 41x 1/3 = 2/3. 


12 2 Conditional Probabilities 
2 Bayes’ Method 


Example 5 We go back to Example 4. As we have just seen, the conditional 
probability P(W|1) is easy to compute. What about P(1|W)? That is, given that 
we picked a white ball, what is the probability that it came from box 1? 

In order to answer this question we start by using the definition of conditional 
probability 


P(UW) 


P(I|W) = a 


Now we use the multiplication rule for the numerator and the average rule for the 
denominator. We get 


PWI|DPO) 
P(W|1)P(1) + P(W|2)P(2) + P(W|3)P(3) 


P(I|W) = 


Numerically, we have 


1/3 x 1/3 _ 


P(Q|W) = WE 


1/6. 


Note that P(1|W) is twice less likely than P(1). That is, given that the ball drawn is 
white, box 1 is less likely to have been picked than boxes 2 and 3. Since box | has 
less white balls than the other boxes, this is not surprising. 

The method in Example 15 is named after Bayes. This method applies when 
we want the conditional probability P(A|B) but what is readily available is the 
conditional probability P(B|A). 


Example 6 It is estimated that 10% of the population has a certain disease. A 
diagnostic test is available but is not perfect. There are two possible misdiagnoses. 
A healthy person may be misdiagnosed as sick with a probability of 5%. A person 
with the disease may be misdiagnosed as healthy with a probability of 1%. Given 
that a person picked at random is diagnosed with the disease, what is the probability 
that this person is actually sick? 

Let D be the event that the person actually has the disease and + be the event 
that the person is diagnosed as having the disease. We are asked to compute the 
conditional probability P(D|+). Note that P(+|D) = 1—0.01 = .99, but P(D|+) 
is not as readily available so we use Bayes’ method. 


P(D+) _ P(+|D)P(D) 
P(+)  P(+|D)P(D) + P(4|D°) P(D°) 


P(D|+) = 


3 Symmetry 13 


We know that P(D) = 0.1 so P(D°) = 0.9. As observed before, P(+|D) = 0.99 
and P(+|D°) = 0.05. Thus, 


0.99 x .1 


wee 0.99 x 0.1 + 0.05 x 0.9 


0.69. 


So given that the person has tested positive, the probability that this person actually 
has the disease is only 0.69. 


3 Symmetry 


It is sometimes possible to avoid lengthy computations by invoking symmetry in a 
problem. We give next such an example. 


Example 7 You are dealt with two cards from a deck of 52 cards. What is the 
probability that the second card is black? 

One way to answer the preceding question is to condition on whether the first 
card is black. Let B and R be the events “the first card is black” and the “first card 
is red,” respectively. Let A be the event “the second card is black.” We have 

P(A) =P(AR) + P(AB) 
=P(A|R)P(R) + P(A|B)P(B) 
=(26/51)(1/2) + (25/51) (1/2) 
=1/2 
Now we show how a symmetry argument yields this result. Since we start with 


26 red and 26 black cards, by the symmetry between red and black cards, we claim 
that 


P (the second card is red) = P(the second card is black). 
Since 
P (the second card is red) + P(the second card is black) = 1, 
we get that 


P(the second card is black) = 1/2. 


14 2 Conditional Probabilities 


Problems 


1. A company has two factories A and B. Assume that factory A produces 80% of 
the products and B the remaining 20%. The proportion of defectives are 0.05 
for A and 0.01 for B. 


(a) What is the probability that a product picked at random comes from A and 
is not defective? 
(b) What is the probability that a product picked at random is defective? 


2. Consider two boxes labeled 1 and 2. In box 1 there are 2 black balls and 3 
white balls. In box 2 there are 3 black balls and 2 white balls. We pick box | 
with probability 1/3 and box 2 with probability 2/3. Then we draw a ball in the 
box we picked. 


(a) Given that we pick box 2, what is the probability of drawing a white ball? 

(b) Given that we draw a white ball, what is the probability that we picked box 
1? 

(c) What is the probability of picking a black ball? 


3. Consider an electronic circuit with components C1 and C2. The probability that 
C1 fails is 0.1. If C1 fails, the probability that C2 fails is 0.15. If C1 works, the 
probability that C2 fails is 0.05. 


(a) What is the probability that both components fail? 
(b) What is the probability that at least one component works? 
(c) What is the probability that C2 works? 


4. Suppose five cards are dealt from a deck of 52 cards. 


(a) What is the probability that the second card is a queen? 
(b) What is the probability that the fifth card is a heart? 


5. Two cards are dealt from a deck of 52 cards. Given that the first card is red, 
what is the probability that the second card is a heart? 

6. A factory tests all its products. The proportion of defective items is 0.01. The 
probability that the test will catch a defective product is 0.95. The test will also 
reject nondefective products with probability 0.01. 


(a) Given that a product passes the test, what is the probability that it is 
defective? 

(b) Given that the product does not pass the test, what is the probability that 
the product is defective? 


7. Two cards are successively selected from a 52 cards deck. The two cards are 
said to form a blackjack if one of the cards is an ace and the other is either a 
ten, a jack, a queen, or a king. What is the probability that the two cards form a 
blackjack? 


4 Independence 15 


8. Two dice are rolled. Given that the sum is 9, what is the probability that at least 
one die showed 6? 

9. Assume that 1% of men and 0.01% of women are color blind. A color blind 
person is chosen at random. What is the probability that this person is a man? 

10. Hemophilia is a genetic disease that is caused by a recessive gene on the X 
chromosome. A woman is said to be a carrier of the disease if she has the 
hemophilia gene on one X chromosome and the healthy gene on the other X 
chromosome. A woman carrier has probability 1/2 of transmitting the disease 
to each son since a son will get an X chromosome from the mother and a Y 
chromosome from the father. Because of her family history a woman is thought 
to have a 50% chance of being a carrier before having children. Given that this 
woman has three healthy sons, what is the probability that she is a carrier? 

11. Toss two fair coins. What is the probability of getting two heads given that there 
is at least one head? 

12. A (not so diligent) student knows about 50% of the material. He takes a multiple 
choice test with 4 possible answers for each question. If he knows the material, 
he gets the answer right. If he does not know the material, he guesses at random. 
Given that he got an answer right, what is the probability that he knew the 
material for that question? 

13. Consider the student population in a college campus. Assume that 55% of the 
students are female. Assume that 20% of the male drink and 10% of the female 
drink. 


(a) Pick a female student at random, what is the probability that she does not 
drink? 

(b) Pick a student at random, what is the probability that the student does not 
drink? 

(c) Pick a student at random, what is the probability that this student is male 
and drinks? 


4 Independence 


Example & We have three boxes labeled 1, 2, and 3. Box | has | white ball and 2 
black balls, Box 2 has 2 white balls and 1 black ball, and Box 3 has 3 white balls. 
One of the three boxes is picked at random and then a ball is picked from this box. 
Given that we draw a white ball, what is the probability that we have picked box 1? 

We have already computed this conditional probability and found it to be 1/6. 
On the other hand the (unconditional) probability of picking box | is 1/3. So the 
information that the ball drawn is white changes the probability of picking box 1. In 
this sense we say that the events A = { box | is picked } and B = { a white ball is 
drawn } are not independent. This leads to the following definition: 


16 2 Conditional Probabilities 


¢ Two events A and B are said to be independent if 
P(AB) = P(A)P(B). 


We have the following consequence: 
e Assume that P(B) > 0. The events A and B are independent if and only if 


P(A|B) = P(A). 


Example 9 Consider again the three boxes of example 1, but this time we put the 
same number of white balls in each box. For instance, assume that each box has 2 
white balls and 1 black ball. Are the events A = { box | is picked } and B = {a 
white ball is drawn } independent? 

We compute P(B) by conditioning on the three possible boxes. Hence, 


P(B) =P(B\|1)P() + P(B|2)P(2) + P(B|3) PQ) 


20) ney e 
373 3°30 $33 


Clearly, P(A) = 1/3. By Bayes’ method, we get 


P(AB) 

P(B) 

P(BIA) P(A) 
~~ P(B) 


P(A|B) = 


Therefore, P(A|B) = P(A). That is, A and B are independent. This should 
be intuitively clear, the fact the ball drawn is white does not yield additional 
information about which box was picked, since all boxes have the same proportion 
of white balls. 


Example 10 Assume that A and B are independent events such that P(A) = 0.1 
and P(B) = 0.3. What is the probability that A or B occurs? 
We want P(A U B). Recall that 


P(AUB) = P(A) + P(B) — P(AB). 


4 Independence 17 


Since A and B are independent, P(A B) = P(A) P(B). Hence, 
P(AUB) =0.1+0.3 —0.1 x 0.3 = 0.37. 


Example 11 Assume that A and B are independent, can they also be disjoint? 
If A and B are disjoint, then AB = #. Thus, P(AB) = 0. However, if A and B 
are also independent, then 


P(AB) = P(A)P(B) = 0. 


Thus, P(A) = 0 or P(B) = 0. Soif A and B are independent, they may be disjoint 
if and only if one of these events has probability zero. In all other cases (i.e., P(A) > 
0 and P(B) > 0), independent events cannot be disjoint. 


Example 12 Consider an electric circuit with two components in series as below. 


Assume that each component fails independently of the other with probability 
0.01. What is the probability that the circuit fails? 

In order for the circuit to fail, we must have that one of the two components fails. 
Let A be the event that the left component fails and B be the event that the right 
component fails. So the probability of failure is 


P(AU B) = P(A) + P(B) — P(AB) = 0.01 + 0.01 — 0.0001 = 0.0199. 


Example 13 Consider a circuit with two components in parallel as below. 


Assume that components fail independently with probability 0.01. What is the 
probability that the circuit fails? 
The circuit fails if both components fail. 


P(AB) = P(A)P(B) = 0.0001. 
As expected the reliability of a parallel circuit is superior to the reliability of a 


series circuit. However, it is the independence assumption that greatly increases the 
reliability. The independence assumption may or may not be realistic. 


18 


2 Conditional Probabilities 


Problems 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


Two cards are successively dealt from a deck of 52 cards. Let A be the event 

“the first card is an ace” and B be the event “the second card is a spade.” Are 

these two events independent? 

Two cards are successively dealt from a deck of 52 cards. Let A be the event 

“the first card is an ace” and B be the event “the second card is an ace.” Are 

these two events independent? 

Roll two dice. Let A be the event “there is at least one 6” and B the event “the 

sum is 7.” Are these two events independent? 

(a) Roll one die 4 times. What is the probability of rolling at least one 6? 

(b) Roll two dice 24 times. What is the probability of rolling at least one double 
6? 

Two cards are dealt from a 52 cards deck without replacement. 


(a) What is the probability of getting a pair? 
(b) What is the probability of getting two cards of the same suit? 


Assume that A and B are independent events with P(A) = 0.2 and P(B) = 
0.5. 


(a) What is the probability that exactly one of the events A and B occurs? 
(b) What is the probability that neither A nor B occurs? 
(c) What is the probability that at least one of the events A or B occurs? 


Assume that the proportion of male students who drink is 0.2. Assume that 
there are 60% of male students and 40% of female students. 


(a) Pick a student at random. What should the proportion of female drinkers 


be in order for the events “the student is male” and “the student drinks” be 
independent? 


(b) Does your answer in (a) depend on the proportion of male students? 


Assume that A and B are independent. 


(a) Show that A and B° are independent. 
(b) Show that A‘ and B° are independent. 


Assume that 3 components are as below. 


5 The Birthday Problem 19 


Assume that each component fails independently of the others with proba- 
bility p;, fori = 1, 2, 3. Find the probability that the circuit fails in the function 
of the p;s. 


5 The Birthday Problem 


In this section we deal with probabilities involving several events. Our main tool is 
a generalization of the multiplication rule. We now derive it for three events A, B, 
and C. We start by using the multiplication rule for the two events AB and C. 


P(ABC) = P(C N(AB)) = P(C|AB)P(AB). 
By the same multiplication rule 
P(AB) = P(B\A) P(A). 
Hence, 
P(ABC) = P(C|AB) P(B|A) P(A) = P(A) P(B|A)P(C|AB). 


The same computation can be done for an arbitrary number of events and yields the 
following: 


¢ Consider n events A;, A2,..., A,. The probability of the intersection of these n 
events is 


P(A, A2...An) = P(A1)P(A2|A1) P(A3|A1A2)... P(An|A1A2... An-1)- 


We now apply this formula to several examples. 


Example 14 Deal 4 cards from a deck of 52 cards. What is the probability to get 
four aces? 

Let A, be the event that the first card is an ace, let Az be the event that the second 
card is an ace, and so on. We want to compute the probability of A; A2A3A4. We 
use the multiplication rule above to get 


P(A1A2A3A4) = P(A1)P(A2|A1) P(A3|A1A2) P(Ag|A1 A243). 


The probability of A; is 4/52. Given that the first card is an ace, the probability 
that the second card is an ace is 3/51. Hence, P(A2|A,) = 3/51. Similarly, 
P(A3|A1A2) = 2/50 and P(A4|A1A2A3) = 1/49. Thus, 


24 


P(A1A2A3Aq) = (4/52) x (3/51) x (2/50) x (1/49) = 6.497 400° 


A pretty slim chance to get four aces! 


20 2 Conditional Probabilities 


Example 15 We now deal with the famous birthday problem. Assume that there 
are 50 students in a class. What is the probability that at least two students have the 
same birthday? 

It is easier to deal with the complement of this event. That is, we are going to 
compute the probability that all 50 students were born on different days. Assume 
that we are going through a list of the 50 birthdays in the class. Let Bz be the event 
that the second birthday in the list is different from the first. Let B3 be the event that 
the third birthday on the list is different from the first two. More generally, let B; be 
the event that the ith birthday on the list is different from the first i — 1 birthdays 
fori = 2,3,...,50. We want to compute the probability of Bz B3...Bso. By the 
multiplication rule, we have 


P(B2B3... Bs) = P(B2) P(B3| Bz) P(B4| Bo B3) ... P(Bso|B2 B3 ... Bag). 


Ignoring leap years, we assume that there are 365 days in a year. We also assume that 
all days are equally likely for birthdays. Note that P(B2) = 364/365. Given that the 
first two birthdays are distinct, the third birthday has only 363 choices in order to 
be distinct from the first two. So P(B3|Bz2) = 363/365. The same reasoning shows 
that P(B4|B3B2) = 362/365. By doing the same type of computation for every 
term in the product above, we get 


P(B2B3... Bs) = 364/365 x 363/365 x 362/365 x --- x 316/365. 


This product is about 0.04. Hence, the probability of having at least two students in 
a class of 50 having the same birthday is 0.96! 


Example 16 Consider a class of 50 students. What is the probability that at least 
one of the students was born on December 25? 

This is yet another case where it is easier to look at the complement of the event. 
We look at the list of birthdays in the class. Let A; be the event that the ith student 
in the list was not born on December 25, for 1 < i < 50. It is reasonable to assume 
that the A; are independent, to know whether or not a certain student was born on 
December 25 does not give us additional information about the birthdays of other 
students (unless there are twins in the class and we assume that is not the case...). 
By the independence assumption we have 


P(A,A2...A50) = P(A1)P(A2)... P(As0). 
Note that each A; has probability 364/365. Thus, 
P(A, Az... Aso) = (364/365)? = 0.87. 


That is, the probability that at least one student in a class of 50 was born on a certain 
fixed day is about 0.13. 


Problems 21 


Example 17 How many students should we have in a class in order to have at least 
one birthday on December 25 with probability at least 0.5? 

Let n be the minimum number of students that satisfies the condition above. We 
use the events A;, for | <i <n, defined in Example 26. We want 


P(AjA2...An) < 0.5. 
By independence we have that 
(364/365)” < 0.5. 
We take logarithms on both sides of the inequality to get 
n1n(364/365) < In(0.5). 


Recall that Inx < Oif x < 1. Thus, 


In(0.5) 
"= 1n(64/365) 


Numerically, n needs to be at least 253. 


Problems 


23. Assume that 3 friends are randomly assigned to 5 classes. What is the 
probability that they are all in distinct classes? 
24. Five cards are dealt without replacement from a 52 cards deck. 


(a) What is the probability that the 5 cards are all hearts? 
(b) What is the probability of a flush (all cards of the same suit)? 


25. Roll 5 fair dice. What is the probability that at least two dice show the same 
face? 

26. What is the probability of getting at least one 6 in 10 rolls of a fair die? 

27. Assume that the chance to win at the lottery with one ticket is 1/1,000,000. 
Assume that you buy one ticket per week. How many weeks should you play to 
have at least 0.5 probability of winning at least once? 

28. Three electric components are in parallel. Each component fails independently 
of the others with probability pj, i = 1, 2,3. What is the probability that the 
circuit fails? 

29. Three electric components are in series. Each component fails independently 
of the others with probability pj, i = 1, 2,3. What is the probability that the 
circuit fails? 


22 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


2 Conditional Probabilities 


Roll a die 4 times. 


(a) What is the probability of getting 4 times the same face? 
(b) What is the probability of getting 3 times the same face? 


The probability of winning a certain game is 1/N for some fixed N. Show that 
you need to play the game approximately 5N times in order for the probability 
to win at least once be 0.5 or more. (Use that In 2 is approximately 2/3 and that 
In(1 — 1/N) is approximately —1/N for N large). 

Take 4 persons at random. What is the probability that they are all born on 
different months? 

In an urn we have 3 red and 3 black balls. We pick a ball at random. If the ball 
is red, we add a red ball; if the ball is black, we add a black ball. So now we 
have 7 balls, either 4 red and 3 black or 3 red and 4 black. We keep adding a 
new ball this way each time we pick a ball. What is the probability that we pick 
three red balls in our first three picks? 

In this problem we are going to derive an approximate formula for the birthday 
problem. Let p, be the probability that n people have n distinct birthdays. We 
have shown that 


364 363 365 -—n+1 


cee ae | ny 


(a) Show that In(p,) = In(i—1/365)+In(1—2/365)+- --+In(l—(n—1)/365). 
(b) Use that In(1 — x) is approximately —x for x near zero to show that 
In(pn) is approximately 


(c) Show that p, is approximately 


—n(n — 1) 
exp | ————_ }. 
2 x 365 
(Use that] +2+3+---+n=n(n+1)/2.) 


(d) Compute p, for n = 5, 10, 20, 30, 40, 50 by using the exact formula and 
the approximation. 


I draw one card from a deck of 52 cards. Let A be the event “I draw a king” and 
let B be the event “I draw a heart.” Are A and B independent? 
I draw 5 cards without replacement from a deck of 52 cards. 


(a) What is the probability that I get 4 kings? 
(b) What is the probability that I get 4 of a kind? 


Problems 23 


37. 


38. 


39. 


40. 


41. 


42. 


43. 


44. 


(a) I roll a die until the first 6 appears. What is the probability that I need 6 or 
more rolls? 

(b) How many times should I roll the die so that I get at least one 6 with 
probability at least 0.9? 

I draw 5 cards without replacement from a deck of 52 cards. 


(a) What is the probability that I get no spade. 
(b) What is the probability that I get no black cards? 


I draw cards without replacement from a deck until I get a spade. 


(a) What is the probability that I need exactly 7 draws? 
(b) Given that 6 or more draws are required, what is the probability that exactly 
7 draws are required? 


Box | contains 2 red balls and 3 black balls. Box 2 contains 6 red balls and b 
black balls. We pick one of the two boxes at random and draw a ball from that 
box. Find b so that the color of the ball is independent of which box is picked. 
A student goes to class on a snowy day with probability 0.5 and on a nonsnowy 
day with probability 0.8. Assume that 10% of the days in January are snowy. 
Given that the student was in class on January 28, what is the probability that it 
snowed that day? 

One die is biased and the probability of a 6 is 1/2. The other die is fair. You pick 
one die at random and roll it. Given that you got a 6, what is the probability that 
you picked the biased die? 

Consider a placement test for Calculus. Assume that 80% of the students pass 
the placement test and that 70% of the students pass Calculus. Experience has 
shown that given that a student has failed the placement test there is a 90% 
probability that the student will fail Calculus. Pick a student at random. Let A 
be the event “the student passes the placement test,” and let B be the event “the 
student passes Calculus.” 


(a) Show that 
P(AB) = P(A) — P(AB‘). 


(b) Use (a) to compute P(AB). 
(c) Given that a student passed the placement test, what is the probability that 
the student will pass Calculus? 


Consider a slot machine with three wheels, each marked with twenty symbols. 
On the central wheel, nine of 20 symbols are bells, and on the left and right 
wheels, there is one bell. In order to win the jackpot one has to get three 
bells. Assume that the three wheels spin independently and that every symbol 
is equally likely. 


(a) What is the probability of hitting the jackpot? 
(b) What is the probability of getting exactly two bells? 


24 


45. 


46. 


47. 


2 Conditional Probabilities 


Assume that A, B, and C are independent events with probabilities 1/10, 1/5, 
and 1/2, respectively. 


(a) Compute P(ABC). 
(b) Compute P(AUBUC). 
(c) What is the probability that exactly one of A, B, or C occurs? 


A tosses one coin and B tosses two coins. The winner is the player who gets the 
most heads. In case of an equal number of heads A wins. 


(a) Compute the probability that B wins given that A gets 0 heads. 

(b) Compute the probability that B wins given that A gets | head. 

(c) Compute the probability that B wins. 

(d) Change the game so that A tosses 2 coins and B tosses 3 coins. The winner 
is still the player who gets the most heads. In case of an equal number of 
heads A wins. Compute the probability that B wins in the new game. 


A rolls one die and B rolls two dice. The winner is the player who gets the most 
6’s. In case of an equal number of 6’s A wins. What is the probability that A 
wins? 


Chapter 3 m®) 
Discrete Random Variables hook for 


1 Discrete Distributions 


e A discrete random variable takes values in a countable space, usually the positive 
integers. The distribution of a random variable X is the sequence of probabilities 
P(X =k) for all values k for which P(X > k) > 0. Moreover, 


y ees a1, 
k 


Example I Toss two fair coins. Let X be the number of heads. The possible values 
of X are 0, 1, and 2. The distribution of X is given by the following table. 


k 0 |1 |2 
P(X =k) | 1/4 | 1/2 | 1/4 


To compute P(X < 1) we note that there are only two disjoint possibilities, 
X = Oand X = 1, and therefore, 


P(X <)=P(X=0)4+ P(XH=N) = —4+ 


1.1) Bernoulli Random Variables 


¢ A Bernoulli random variable X has only two possible values, 0 and 1. We denote 
P(X = 1) = pand P(X = 0) = 1 — p:p, where p is a fixed real number in (0, 1). 


© Springer Nature Switzerland AG 2022 25 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_3 


26 3 Discrete Random Variables 


These are the simplest possible random variables. Perform a random experiment 
with only two possible outcomes, success or failure. Set X = 1| if the experiment 
is a success and X = 0 if the experiment is a failure. Then, X has a Bernoulli 
distribution. 


Example 2 Roll a fair die once. Let X = 1 if we roll a 6 and X = O otherwise. 
Then X is a Bernoulli random variable with p = 1/6. 


Example 3 Toss a fair coin once. Let X = 1 if we gets heads and X = 0 if we get 
tails. Then X is a Bernoulli random variable with p = 1/2. 


1.2. Geometric Random Variables 


Example 4 Roll a fair die until you get a 6. Let X be the number of rolls to get the 
first 6. The possible values of X are all strictly positive integers. Note that X = 1 
if and only if the first roll is a6. So P(X = 1) = 1/6. In order to have X = 2 the 
first roll must be anything but 6 and the second one must be 6. By independence of 
the different rolls we get P(X = 2) = 5/6 x 1/6. More generally, in order to have 
X = k the first k — 1 rolls cannot yield any 6 and the kth roll must be a 6. Thus, 


34-7. 1 
BK oe hm Ce) orale et 


Example 4 is a typical example of a geometric distribution. Here is the general 
definition. 


¢ Consider a sequence of independent identical trials. Assume that each trial can 
result in a success or a failure. Each trial has a probability p of success and 
q = 1-p: of failure. Let X be the number of trials up to and including the first 
success. Then X is called a geometric random variable. The distribution of X is 
given by 


P(X =k) =q*'p forallk > 1. 
Note that if X is a geometric random variable, then P(X = k) > 0 for all integers 


k > 1. These probabilities get closer and closer to 0 as k —> +00 but are never 0. 
Recall the following geometric series from Calculus. 


- 1 
De altete te teas 
k>0 a 


for all x € (—1, 1). 


1 Discrete Distributions 27 


Assume that X is a geometric random variable with parameter p. Then, 


PUSS) gp 


k>1 k>1 


=p(it+atq’+...) 


This shows that the geometric distribution is indeed a probability distribution. 


Example 5 Toss a fair coin until you get tails. What is the probability that exactly 
three tosses were necessary? 
In this example we have p = g = 1/2. So 


P(X =3)=q’?p=1/8. 


What is the probability that three or more tosses were necessary? 
The long way to answer this question is to do 


POSS Pe Se) 
k>3 


and then use the geometric series to compute the sum. A better way is to notice 
that X > 3 if and only if the first two tosses are heads (why?). The latter even has 
probability g7. Thus, 


P(X >3)=q7° =1/4. 


Next we generalize the last observation. 


e Let X be a geometric random variable with parameter p. Then, 
P(X >ry=q'. 


Note that if X > r, then each one of the first r experiments is a failure. 
Conversely, if each one of the first r experiments is a failure, then X > r. Using 
the fact that experiments are independent of each other, the probability that each 
one of the first r experiments is a failure is g’. This proves the formula. 


Example 6 Two players roll a die. If the die shows a 6, then A wins, and if the die 
shows a 1 or a 2, then B wins. The die is rolled until A or B wins. What is the 
probability that A wins? 


28 3 Discrete Random Variables 


Note that A can win in | roll or in 2 rolls or in 3 rolls and so on. These possibilities 
are disjoint (why?). Hence, 


P(A wins) = ) > P(A wins in n rolls). 


n>1 


The event “A wins inn rolls” is the same as the event “‘the first n — 1 rolls are draws 
and the nth roll is a 6.” Note that the probability that a roll results in a draw is 3/6. 
Thus, 


1 
P(A wins inn rolls) = or; x 


ale 


Summing the geometric series, we get 


P(A wins) = So x : =3 


n>1 
There is an intuitive way to check this result by noting that 


1/6 1 
1/6+2/6 3 


where 1/6 is the probability of A winning in | roll and 2/6 is the probability of B 
winning in | roll. 


Problems 


1. Toss three fair coins. Let X be the number of heads. 


(a) Find the distribution of X. 
(b) Compute P(X > 2). 


2. Roll two dice. Let X be the sum of the faces. Find the distribution of X. 
3. There are 38 pockets in a roulette, 18 are red, 18 are black, and 2 are green. I 
bet on red until I win. Let X be the number of bets I make. 


(a) What is the probability that X is 2 or more? 
(b) What is the probability that X is exactly 2? 


4. I roll four dice. I win if I get at least one 6. What is the probability of winning? 

5. Roll two fair dice. Let X be the largest of the two faces. What is the distribution 
of X? 

6. I draw 2 cards without replacement from a deck of 52 cards. Let X be the 
number of aces I draw. Find the distribution of X. 


2 Expectation 29 


7. How many times should I toss a fair coin in order to get tails at least once with 
probability 90%? 

8. In a lottery there are 100 tickets numbered from 1| to 100. Let X be the ticket 
drawn at random. What is the distribution of X? 

9. [roll a die until I get a 6. 


(a) What is the probability that the first 6 occurs between rolls 3 and 5? 
(b) Given that the first two rolls were not 6’s, what is the probability I need 5 
rolls or more in order to get a 6? 


10. A and B rolladie. A wins if the die shows a 6 and B wins if the die shows a 1. 
The die is rolled until someone wins. 


(a) What is the probability that A wins? 
(b) What is the probability that B wins? 
(c) Let T be the number of times the die is rolled. Find the distribution of T. 


11. Let X be a geometric random variable. Show that 
P(X >r+s|X >r)= P(X >). 
12. Let X be a discrete random variable. 
(a) Show that 
P(X =k) = P(X >k-1)— P(X > k). 


(b) Assume that for all k > 1 we have P(X > k) = gk. Use (a) to show that 
X is a geometric random variable. 


2 Expectation 


e Let X be a discrete random variable, and then the expectation (or expected value) 
of X is defined by 


E(X) =) kP(X =h, 
k 


where the sum is over all values k such that P(X = k) > 0. 


Note that if a random variable takes infinitely many values then the computation 
of its expectation involves an infinite series. The expectation will be finite only if 
the infinite series converges. In fact there are many examples of random variables 
with infinite expectation, see the problems. 


30 3 Discrete Random Variables 


Example 7 Rolla fair die. Let X be the face shown. We have 
P(X =k) =1/6 


for every k = 1,2,..., 6. Thus, 


1 1 1 
E(X) =) kP(X = WH 1x Z 42x ete 6x 5 = 7/2. 
k 


It may seem odd that the expected roll is 7/2. After all each face is a whole 
number! However, the Law of Large Numbers states that the average over many 
rolls approaches 7/2. This gives a meaning to the expectation and also explains why 
this is a crucial concept. 


¢ Let X be a Bernoulli random variable with parameter p. That is, P(X = 1) = p 
and P(X = 0) = 1 — p. Then, E(X) = p. 


By definition of the expectation, 


E(X) =0x P(X =0)+1x P(X=1)=p. 


e Let X be a geometric random variable with parameter p. That is, P(X = k) = 
q*! p fork = 1,2.... Then, 


1 
EOS 
P 


Recall that a geometric counts the number of independent trials until the first 
success. The success probability of a single trial is p. Hence, if p is small (ie., 
success is unlikely), the number of trials to achieve success should be high. This is 
consistent with the formula for the expectation. We now compute this formula. 


Cc (oe) 
EO) = Soke pp kg 
k=1 k=1 


Recall that 


7 1 
yz = for x € (—1, 1). 


—xX 


2 Expectation 31 


Thus, by taking derivatives on both sides of the preceding equality, we get 


[o.@) 

2 aa bay 1-1) 
= 5 (1,1), 

= (1 — x) 


We plug x = q and get for the expected value 


E(X) = py chat! a 
= = p—— = -. 
a (l-—q) Pp 


Example § Rolla die until you get a 6. What is the expected number of rolls? 
Let T be the number of rolls to get a 6. This is a geometric random variable with 
p = 1/6. Thus, E(T) = 6. 


2.1 The Expectation of a Sum 


The following properties are critical in many computations: 


e For any random variables X and Y (discrete or continuous) 
E(X+Y) = E(X)+ E(Y). 


¢ For any real number a we have E(aX) = aE(X). 


In words, the expectation is a linear operator. The proof of this fact involves joint 
distributions and will be done in the relevant chapter. 


Example 9 Roll two dice. What is the expected value of the sum S$? 

The long way to answer this question would be to compute the distribution of S 
and then take the expected value of this sum. Instead we write S = X + Y where X 
and Y are the faces of each die. Using Example 7, 


E(S) = E(X+Y)= E(X) + E(Y) =7. 


¢ Leta be areal number. Then, E(a) = a. 


We can think of the constant a as a random variable with only one value. 
Therefore, this value is taken with probability 1. Hence, E(a) =a x l=a. 


32 


3 Discrete Random Variables 


Problems 


13. 


14. 


15. 


16. 
17. 


18. 


19. 


20. 


Let X be a random variable such that P(X = 0) = 1/5 and P(X = 4) = 4/5. 
Find E(x). 

(a) Toss two fair coins. What is the expected number of heads? 

(b) Toss five fair coins. What is the expected number of heads? 

The probability of finding someone in favor of a certain initiative is 0.01. We 
interview people at random until we find a person in favor of the initiative. What 
is the expected number of interviews? 

Roll four dice. What is the expected value of the sum? 

There are three components in a circuit. During a given experiment each one can 
fail with probability p. What is the expected number of working components 
after the experiment? 

We roll a die. You pay me b $ if the die shows 5 or 6. I pay you 1 $ otherwise. 
Clearly, the probabilities of winning are not the same for both players. Let W 
be my expected winnings. 


(a) Show that 
E(W) = bx 1/3+ (-1) x 2/73. 
(b) Explain why to make this a fair game we want E(W) = 0. 


(c) Find b so that this is a fair game. 


I roll 4 dice. If there is at least one 6, you pay me 1 $. If there are no 6’s, I pay 
you 1$. 

(a) What are my expected winnings? 

(b) Is this a fair game? 

(c) How would you make it into a fair game? 


In this problem we give an example of a discrete random variable for which the 
expectation is not defined. 


(a) Use the fact that 


1 


a= /6 


]8 


k=1 


to find c so that P(X = k) = c/k? is a probability distribution. 
(b) Show that the expectation of the random variable defined above does not 
exist. 


3 Variance 33 
3 Variance 


The expectation is a measure of location. The expectation can be thought of a 
“typical” outcome of the corresponding probability distribution. But how good is 
the expectation at summarizing the distribution? Next we define the variance that 
is a measure of how informative the expectation is. A small variance means low 
dispersion of the distribution around the expected value. A large variance means 
high dispersion around the expected value. In the latter case the expected value is 
not very informative about the distribution. 


e¢ Let X be arandom variable with mean E(X) = yw. The variance of X is denoted 
by Var(X) and is defined by 


Var(X) = E[(X — p)’I. 
It can be shown that Var(X) can be expressed as 
Var(X) = E(X”) — pr’. 


We will mostly use the latter formula. The standard deviation of X is denoted by 
S D(X) and is defined by 


SD(X) = J Var(X). 


In order to compute the Var(X) we will need to compute E(X 2). For a discrete 
random variable X it can be shown that (see the problems) 


EAS P=), 
k 


e Let X be a Bernoulli random variable with parameter p. Then, 
Var(X) = pti — p). 
Recall that 
E(X)=p. 
By the formula above 


E(X*) =0° x (l— p)+ 1° x p=p. 


34 3 Discrete Random Variables 


Thus, 
Var(X) = E(X”) — E(X)’ = p— p* = p(1— p). 


Example 10 Rolla fair die. Let X be the face shown. What is the variance of X? 
Using that P(X = k) = 1/6 for every k = 1,2,...,6, 


91 
E(X*) =I" x 1/6+2° x 1/6+---+6° x 1/6= = 


Since E(X) = 7/2 (see Example 7), 


91 49 35 
Var(X) = E(X*) — E(Xy = =~-— ==—. 
ar(X) (Xx) — E(X) Bo a 
So the standard deviation is approximately 1.7. This is large for a distribution on 
{1,..., 6}. Since all probabilities are equal in this example, we expect a large 
dispersion around the expected value and therefore a large standard deviation. 


¢ Let T be a geometric distribution with parameter p. Then, 


_ 
Var(T) = am 


Note that as p — 0, Var(T) — +00 and that SD(T) is of the same order of 
magnitude as E(T) = 1/p. 

We now prove the formula. We will need the following property. 

For any function g > 0, 


E(g(T)) = > g()P(T =k). (1) 


k>1 


As always we need to compute E(T7). For the geometric distribution it is easier 
to compute E(T(T — 1)) first. We need a new fact about geometric series. Recall 
that for every x in (—1, 1) we have 


Power series are infinitely differentiable on their interval of convergence. We take 
derivatives twice in the formula above to get 


CO 2 
k-2 
ne he eae (2) 


3 Variance 


Using (1) and letting g = 1 — p, 


E(T(T -1)) =) kk - DPT =h) 


k=1 
(oe) 
=) kk -1)q*"'p 
k=2 
Cc 
=pq yo kk — Iqgh. 
k=2 
We let x = q in (2) to get 
2 2 
Or CMG ena ee ;= 
(-q) p 


By the linearity of the expectation, 


E(T(T —1)) = E(T? —T) = E(T’) — E(T). 


Hence, 
2 2q 1 
E(T*) = E(T(T -1)+ EQ) = 5+ -. 
P P 
Finally, 
2 1 1 29q+tp-1 
Var(T) = E(T?) - EY = 44—-- [=F 
P Pp Pp 


Note that p+ q = 1,s02q + p—1=q. Hence, 


Var(T) = ae 
P 


3.1 Variance and Independence 


¢ Two discrete random variables X and Y are said to be independent if 
PCX =x} O{¥ = y}) = P(X =x)P(Y = y) forall x, y. 


We now examine two examples. 


36 3 Discrete Random Variables 


Example 11 Roll two dice. Let X be the face shown by the first die and let S be the 
sum of the two dice. Are X and S independent? 

Intuitively it is clear that the answer should be no. It is enough to find one x and 
one s such that 


PEX =x}N{S=s}) 4 P(X =x)P(S=s) 


in order to show that X and S are not independent. For instance, take x = | and 
s = 12. Clearly, if one of the two dice shows 1, the sum of the two dice cannot be 
12. So P(X = 1}N{S = 12}) = 0. However, P(X = 1) and P(S = 12) are 
strictly positive so P({X = 1}N{S = 12}) 4 P(X = 1)P(S = 12). Thus, X and 
S are not independent. 


Example 12 Toss two fair coins. Set X = 1 if the first coin shows heads, and set 
X = O otherwise. Set Y = | if the second coin shows heads, and set Y = 0 
otherwise. Are X and Y independent? 

We are going to show that X and Y are independent. We need to show that 


PX =x} O{¥ = y}) = PX =x)PV =y) 
for all 4 possibilities for (x, y). Our sample space is 
Q = {(H, H), (H, T), (T, A), (T, T)}. 


We need to examine the 4 possible outcomes for (X, Y). Note that the event {X = 
0} 1 {Y = 0} is the event {(7, T)} and that has probability 1/4. Note that P(X = 
0) = 2/4 = P(Y = 0). So the product rule holds for x = 0 and y = 0. We now 
examine x = 0 and y = 1. The event {X = 0} {Y = 1} is the event {(7, H)}. 
This has probability 1/4. Since P(Y = 1) = 2/4, the product rule holds in this case 
as well. The two remaining cases are treated in a similar way, and we may conclude 
that X and Y are independent. 
Next we turn to the variance of two independent random variables. 


e Assume that X and Y are two INDEPENDENT random variables. Then, 
Var(X + Y) = Var(X) + Var(Y). 
The property above involves the joint distribution of (X, Y). It will be proved in 


the appropriate chapter. 


Example 13 Roll 2 dice. Let S be the sum of the two dice. What is the variance of 
S? 

Let X and Y be the faces shown by each die. From Example 10 we know that 
Var(X) = Var(Y) = 35/12. Since X and Y are independent, 


Var(S) = Var(X + Y) = Var(X) + Var(Y) = 2 x 35/12 = 35/6. 


3 Variance 37 


Two other useful properties of the variance are the following: 


¢ Leta be areal number. Then, 
Var(axX) = a’Var(X), 
and 
Var(X —a) = Var(X). 


The last two properties are direct consequences of the definition of variance. 
Their proofs are left to the reader. 
As a consequence of the preceding properties, we have: 


* Let X have expectation jz and variance o7; then 


— (a 
*) = 04nd Var(——" 


E( yet, 


oO 


We now prove these properties. 
By the linearity of the expected value, 


Xx- 1 
ER "y= SEX =), 
oO 
and 
E(X—p)= E(X)- E(u) = u-p=0. 
Hence, 
pee ag 


We now compute the variance. By the quadratic property of the variance 


xX 1 
S Var(X — pL). 


Var( j= =) 


Since the variance is shift invariant, Var(X — jw) = Var(X). Hence, 


x 
Var( eS 


i 2 
= 0° = 1: 
) a 


38 


3 Discrete Random Variables 


Problems 


21. 


22. 


23. 


24. 


25. 


26. 
27. 


28. 


Let X be a random variable such that P(X = 0) = 1/5 and P(X = 4) = 4/5. 

Find the variance of X. 

The probability of finding someone in favor of a certain initiative is 0.01. We 

interview people at random until we find a person in favor of the initiative. What 

is the standard deviation of the number of interviews? 

(a) Graph the histograms of the distribution of one die and of the distribution 
for the maximum of two dice (rolled independently). 

(b) Based on the histograms, which distribution should have the larger expec- 
tation? Which distribution should have the larger variance? 

(c) Compute the variance of the distribution of the maximum of two dice. 

Toss three coins. Let X be the number of heads. Find: 


(a) E(X). 
(b) Var(X). 


Roll a fair die with n faces. Let X be the number that appears. 


(a) What is the distribution of X? 
(b) Show that 


n+1 
E(X)= 7 


(Use thatl+2+---+n= ned 
(c) Show that 


(n+ 1)(n — 1) 


Var(X)= 36 


(Use that 1? + 2? + +--+? = M@tDEntD » 


Let X have variance 2. What is the variance of —3X + 1? 

Roll two dice successively. Let X be the face of the first die and Y the face of 
the second die. 

(a) Find Var(X — Y). 

(b) Find Var(|X — Y|). 


Let a be a real number and X be a random variable. Show that: 


(a) Var(X — a) = Var(X). 
(b) Var(aX) = a?Var(X). 


Problems 39 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


In this problem we prove the formula 


E(X*) = SOR P(X =b). 
k 


Assume that the support of X (i.e., all k such that P(X = k) > 0) is all strictly 
positive integers. Let Y = X?. 


(a) What is the support S of Y? 
(b) Show that 


E(Y) = ere = ). 


leS 


(c) Let k* = £ in the sum in (b). Show that 


E(Y)= bac! =k). 
k=1 


Three people toss one fair coin each. The winner is the one whose coin shows 
a face different from the two others. If the three coins show the same face, then 
there is a new round of tosses, until someone wins. 


(a) What is the probability of exactly one round of tosses? 
(b) What is the probability that at least three rounds of tosses are necessary? 


A and B take turns rolling a die. A starts. The winner is the first one that rolls a 
6. What is the probability that A wins? 

Two people play the following game. They toss two fair coins. If the two coins 
land on heads, then A wins. If one coin lands on heads and the other on tails, 
then B wins. If the two coins land on tails, then the coins are tossed again until 
someone wins. What is the probability that B wins? 

The probability of finding someone in favor of a certain initiative is 0.01. We 
interview people at random until we find a person in favor of the initiative. What 
is the probability that we need to conduct 50 or more interviews? 

Draw five cards from a 52 cards deck without replacement. 


(a) What is the expected number of red cards among the five cards that have 
been drawn. 

(b) What is the expected number of hearts in 5 cards dealt from a deck of 52 
cards? 


Roll two dice. I win 1 $ if the sum is 7 or more. I lose b $ if the sum is 6 or less. 
Find b so that this is a fair game. 


40 3 Discrete Random Variables 


36. Toss 5 fair coins. 


(a) What is the expected number of heads? 
(b) What is the variance of the number of heads? 


37. Assume that the random variable T is such that E(T) = 1 and E(7(T —1) = 2. 
What is the standard deviation of T? 


4 Coupon Collector’s Problem 


Assume that a certain brand of cereals has a cartoon character in each box. There 
are r different cartoon characters. What is the expected number of cereal boxes that 
need to be purchased in order to get all the cartoon characters? 

Let 7; be the number of boxes needed to get the first character. Obviously, 7; = 
1. Let 7) be the number of boxes needed to get the second (different) character. 
Since we have already one character every time we buy a box, there is a probability 
- of getting the same character we already have and a probability rt to get a 
different one. Hence, 72 is a geometric random variable with success probability 
se More generally, let 7, be the number of boxes needed to get the kth different 
characters given that we have already k — 1 different characters. Since we have 
already k — 1 characters every time we purchase a box, the probability to get a 
kth different character is mGen That is, 7; is a geometric random variable with 
probability of success pat fork = 2,...,r. The number of boxes needed to have 
a complete collection is therefore 


i Geney Cp cea eet ee 
Recall that the expected value of a geometric random variable with success 


probability p is 1/p. Hence, the expected number of boxes needed to have the 
complete collection is 


r r ee 
Bae ps eed) Sd ae ae ee 
r-l r-2 2 1 
It is convenient to rewrite the formula as 
1 1 
EQ +ht--+h)=rdt+5 +--+). 


As r goes to infinity one can show that 


1 1 
[fea peed olny 
2 r 


Problems 41 


in the sense that the ratio goes to 1. Hence, the expected number of boxes needed to 
complete the collection is approximately r Inr when r is large. 


Problems 


38. [roll a die repeatedly. 


(a) What is the expected number of rolls to get three different faces? 
(b) What is the expected number of rolls to get all six faces? 


39. In an urn there are 100 balls numbered from | to 100. I draw at random one ball 
at a time and then I put it back in the urn. 


(a) What is the expected number of draws to get 10 different numbers? 
(b) What is the expected number of draws to get all of the 100 numbers? 


Chapter 4 m®) 
Binomial Random Variables hook for 


1 Binomial Probability Distribution 


¢ Letn > 1 be an integer and p a real number in (0, 1). We perform n identical 
and independent trials. Each trial has a success probability equal to p. The total 
number of successes in 7 trials is a binomial random variable with parameters n 
and p. 

¢ Let B be a binomial random variable with parameters n and p. Then, the 
distribution of B is given by 


P(B=k)= (7) p*(1— p)"~* fork =0,1,...,n, 


where the binomial coefficient (7) is read as “n choose k” and is defined by 


n ea n! 
(7) ~ kin — kV 


Recall that 

0! =1 
1! =1 
2!=2xl= 
3!=3x2x1l=6 
44=4x3x2x1=24 

and so on. 

© Springer Nature Switzerland AG 2022 43 


R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_4 


44 4 Binomial Random Variables 


Note that for every n > 0, 


() 1m (2) = 


Hence, for the particular cases k = 0 and k = n, the general formula simplifies 
to 


Example I Roll a fair die 5 times. What is the probability of getting exactly two 
6’s? 

In this case we are doing n = 5 identical rolls. The rolls are independent of each 
other. That is, knowing the first roll does not give any information about the second 
roll and so on. The probability of success is p = 1/6. Let B be the number of 6’s (or 
successes) in 5 trials. Thus, B is a binomial random variable with parameters n = 5 
and p = 1/6. Hence, 


P(B=2)= (3) (1/6)2(5/6)°. 


We compute 


Hence, 
533 
P(B=2)= 10 ~ 0.16. 


Example 2. What is the probability of getting at least one 6 in five rolls? 
We want the probability of {B > 1}. It is quicker to compute the probability of 
the complement of {B > 1}, which is {B = O}. 


P(B =0) =(1— p)" = (5/6)? ~ 0.4. 


Thus, the probability of getting at least one 6 in 5 rolls is approximately 0.6. 


Example 3 Assume that births of boys and girls are equally likely. What is the 
probability that a family with three children has three girls? 

This time we have n = 3 trials and each has a probability of success (having a 
girl) equal to p = 1/2. Thus, 


P(B = 3) = (1/2)? = 1/8. 


Problems 45 


Example 4 Consider four families, each with three children. What is the probability 
that exactly one family has three girls? 

We have n = 4 trials and a trial is a success if the corresponding family has 
exactly three girls. According to Example 3 the probability of success is 1/8. Hence, 
the number of families with exactly 3 girls is a binomial with n = 4 and p = 1/8. 
Thus, 


P(B=l)= (*) (1/8)!(7/8)° ~ 0.33. 


A very convenient way of computing the binomial coefficients is the so-called 
Pascal’s triangle. Here are the first few rows of the triangle. 


n \ k | 0] 1 2 3 4] 5 
0 1 
1 1 1 
2 1} 2 1 
3 1] 3 3 1 
4 1| 4 6 4 1 
5 1/5] 10) 10} 5) 1 


We read (1) at the intersection of row n and column k. For instance, () = 6. The 
triangle is filled by using the following property: 


n\  (n-1 ee n—-1 

k} \k=-1 a a 
For instance, (3) = 10 is obtained by adding the number immediately above and the 
number above and to the left. 


Problems 


1. Toss a fair coin 4 times. 


(a) What is the probability of getting exactly 2 heads? 
(b) What is the probability of getting at least 1 head? 
(c) What is the probability of getting at least 3 heads? 


2. Toss a fair coin 7 times. Let B be the number of heads. 


(a) Compute the distribution of B. 
(b) Draw the histogram of the distribution of B. 


46 4 Binomial Random Variables 


(c) What is the expected value of B? 
(d) Find k for which P(B = k) has a maximum value. This k (there may be 
two k’s actually) is called the mode of the distribution. 


3. Items are examined sequentially at a manufacturing plant. The probability that 
an item is defective is 0.05. What is the probability that exactly two of the first 
20 items examined are defective? 
4. Roll two fair dice 5 times. 
What is the probability of getting exactly one sum equal to 7? 
5. Roll two fair dice 7 times. 
What is the probability of getting exactly two sums larger than or equal to 7? 
6. Roll a pair of dice 10 times. 


(a) What is the probability of getting at least once a pair of 6’s? 
(b) What is the probability of getting twice a pair of 6’s? 


7. Assuming that boys and girls are equally likely, how many children should 
a couple plan to have in order to have at least one boy and one girl with 
probability 0.99? 

8. Assume that 10% of the population are left-handers. What is the probability 
that in a class of 40 there are at least 3 left-handers? 

9. Given that there were 5 heads in 12 tosses of a fair coin. 


(a) What is the probability that the first toss was head? 
(b) What is the probability that the last two tosses were heads? 
(c) What is the probability that at least two of the first five tosses were heads? 


10. Show that for allO0 < k <n, 
n\ _ n—-1 i: n—-1 
kk) \k=-1 kp 
2 Mean and Variance 


We start with the following important property: 

e¢ A binomial random variable B with parameters n and p can be written as a sum of 
n independent identically distributed Bernoulli random variables with parameter 
D. 
We define the Bernoulli random variables in the following way. If the first trial 


is a success, let X; = 1; otherwise let X; = 0. If the second trial is a success, let 
X27 = 1; otherwise let X27 = 0. We do this for all n trials. Then, 


BEX Ay ee 


2 Mean and Variance 47 


This is so because there are as many successes as there are 1’s in the r.h.s. Note also 
that the X; are independent identically distributed (i.i.d. in short) Bernoulli random 
variables with parameter p. 


e Assume that B is a binomial random variable with parameters n and p. Then, 
E(B) =np 
Var(B) = np(1 — p). 


We now prove these formulas. Using that B is a sum of Bernoulli random 
variables, 


B=X,+X2.+---+Xn. 
By the linearity of the expectation, 
E(B) = E(X1) + E(X2) +--+ + E(Xn). 
Since a Bernoulli random variable with parameter p has an expectation equal to p, 
E(B) = np. 
Using now that the Bernoulli random variables are independent, 
Var(B) = Var(X1) +---+ Var(Xy). 


Recall that a Bernoulli random variable with parameter p has a variance equal to 
pC — p); hence 


Var(B) =np(i — p). 


Example 5 Roll a die 30 times, what is the expected number of 5’s? 
The number of 5’s is a binomial random variable with parameters n = 30 and 
p = 1/6. So the expected number of 5’s is np = 5. 


Example 6 We roll two dice 10 times. What is the expected number of double 6’s? 
What is the variance of this distribution? 

The probability of rolling a double 6’s is 1/36. Hence, the number of double 6’s 
in 10 rolls is E(B) = np = 10/36. 

The variance of this distribution is 


35 35 


1 
LENO ee 
1 asia ae GT 


48 4 Binomial Random Variables 
2.1 Derivation of the Binomial Distribution 


Using that B = X; + X2+---+ Xp we now derive the distribution of a binomial 
random variable with parameters n and p. 

Let k be an integer between 0 and n. One of the ways B may be equal to k is if 
the first k Bernoulli random variables are successes and the last n — k are failures. 
Thanks to the independence of the Bernoulli random variables, this happens with 
probability p*(1 — p)"~*. However, there are as many ways for B = k as there are 
ways to distribute k 1’s and n — k 0’s among n places. It turns out that there are 


exactly (7) ways to do so (this will be proved in the Combinatorics chapter). All 


these different possibilities of having k successes are disjoint and they all have the 
same probability p*(1 — p)"~*. Hence, for any 0 < k <n 


P(B=k)= (i) pr — py". 


3 Normal Approximation 


As n gets large, the computation of P(B > a) may involve the computation of 


many binomial probabilities. The most important technique around this problem is 
the normal approximation. We start by showing how to use the normal table. 


3.1 The Normal Table 


A standard normal variable is a continuous random variable that will be studied 
in more detail when we introduce continuous random variables in a latter chapter. 
Our present goal is to learn to use a normal table in order to be able to find an 
approximation to binomial probabilities. 

Throughout the book we will use the notation Z to designate a standard normal 
random variable. A normal table is provided in the appendix for the probabilities 
P(O < Z < z) for values of z in [0, 2.99]. Graphically, the normal table gives the 
shaded area for different values of z. See Fig. 4.1 below. 


Fig. 4.1 The normal table 
gives the value of the shaded 
area between 0 and z 


3 Normal Approximation 49 


We copy below the first 5 rows of the normal table. 


0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 
0.0 | 0 0.0040 | 0.0080 | 0.0120 | 0.0160 | 0.0199 | 0.0239 | 0.0279 | 0.0319 | 0.0359 
0.1 | 0.0398 | 0.0438 | 0.0478 | 0.0517 | 0.0557 | 0.0596 | 0.0636 | 0.0675 | 0.0714 | 0.0753 
0.2 | 0.0793 | 0.0832 | 0.0871 | 0.0910 | 0.0948 | 0.0987 | 0.1026 | 0.1064 | 0.1103 | 0.1141 
0.3 | 0.1179 | 0.1217 | 0.1255 | 0.1293 | 0.1331 | 0.1368 | 0.1406 | 0.1443 | 0.1480 | 0.1517 
0.4 | 0.1554 | 0.1591 | 0.1628 | 0.1664 | 0.1700 | 0.1736 | 0.1772 | 0.1808 | 0.1844 | 0.1879 


To find for instance P(O < Z < 0.32) we look at the intersection of the row 
starting with 0.3 and the column starting with 0.02. We read 


P(O < Z < 0.32) = 0.1255. 


To find P(O < Z < 0.45), we look for the intersection of the row starting with 0.4 
and the column starting with 0.05, and we read P(O < Z < 0.45) = 0.1736. 


Example 7 What is the probability that a standard normal random variable Z is 
larger than 1? 

Note that P(O < Z < 1) = 0.34. The total area under the normal curve is 1. So 
by symmetry of the distribution P(Z > 0) = 0.5. Hence, 


P(Z > 1) =0.5—0.34 = 0.16. 


Example & What is the probability that a standard normal random variable Z is 
larger than —1? 

It is always a good idea to sketch the graph of the normal curve and shade the 
area that we are looking for. 

By symmetry of the distribution of Z, P(Z > —1) = P(Z < 1). Note now 
that P(Z < 1) = P(Z < 00+ PO < Z < 1) = 0540.34 = 0.84. Hence, 
P(Z > —1) = 0.84. 


Example 9 Find c, so that P(Z < c) = 0.9. 
First note that c > 0 (why?). Then, 
P(Z<c)=P(Z<0)+PO0<Z<c). 
Since P(Z < 0) = 0.5, we get P(O < Z < c) = 0.4. We see from the table that c 
is between 1.28 and 1.29. Since c is closer to 1.28, we take c = 1.28. 
Example 10 Find c, so that P(Z < c) = 0.2. 
First note that c < 0 (why?). By symmetry of the standard normal, 


P(Z <c)=P(Z>-c). 


50 4 Binomial Random Variables 


Since P(Z > —c) = 0.5 — P(O < Z < —c), we get P(O < Z < —c) = 0.3. We 
read in the table that —c is approximately 0.84. Thus, c = —0.84. 


Example 11 What is the probability that a standard normal random variable Z is 
between —2 and 2? 


P(-2 < Z <2) = P(-2< Z <0)+ PO < Z <2)=2P(0 < Z < 2) ~ 0.95. 


So there is only a 5% chance that a standard normal distribution is larger than 2 or 
smaller than —2. 


3.2 Normal Approximation 


Consider the following histogram of a binomial B with n = 10 and p = 1/2. On 
the y axis we took the unit to be 1/1024. Hence, we read P(B = 4) = 210/1024. 
Observe that the histogram is bell shaped. This is why we are able to use a normal 
distribution to estimate binomial probabilities (Fig. 4.2). 


¢ Let B be a binomial distribution with parameters n and p. Let g = | — p. As 


n increases, the distribution of Tia approaches the distribution of a standard 
normal random variable Z in the sense that for any a < b, 


_ —1/2 b-— 1/2 
Peep eh ~ pC eget aan 


VUPq V"Pd 


We are using a continuous random variable Z to approximate a discrete random 
variable B. This is why we enlarge the interval by 1/2 on both sides. This is 


> OO. 


Fig. 4.2 This is the 
histogram of a binomial with 250 
n= 10 and p = 1/2 


150 


100 


50 


3 Normal Approximation 51 


especially important if a = b or if ./npq is small. In practice, the approximation 
yields good results when both np and nq are larger than 5. 


Example 12 Roll a fair die 36 times, what is the probability that we get exactly six 
6’s? 

Let B be the number of 6’s we get in 36 rolls. Then B is a binomial distribution 
with parameters n = 36 and p = 1/6. We first compute the exact probability. 


P(B =6)= & (1/6)°(5/6)°° ~ 0.176 


We now use the normal approximation. Note that np = 6 andnpq = 5. 


P(B =6) =P(6—-1/2< B<6+1/2) 
p (<a ee nian 
Vnp(i—p) ~~ Vnp(i— p) 
=P(—0.22 < Z < 0.22) 
=2P(0 < Z < 0.22) 
=0.174 


Z 


We see that the normal approximation is pretty good even for 7 not very large. 


Example 13 A hotel has accepted 210 reservations, but it has only 200 rooms. 
It is assumed that guests will actually show up independently of each other with 
probability 0.9. What is the probability that the hotel will not have enough rooms? 

Let B be the number of guests that will actually show up. This is a binomial 
random variable with parameters n = 210 and p = 0.9. The mean number of guests 
showing up is np = 189 and the variance is npq = 18.9. The normal approximation 
yields 


201 — 189 — 1/2 
Vv 18.9 


It is rather unlikely that not enough rooms will be available. 


P(201 < B) ~ P( < Z) = P(2.64 < Z) ~ 0.004. 


Example 14 Assume that a fair coin is tossed 10,000 times. Let B be the number of 
heads. What is the probability of getting exactly 5000 heads? 

The random variable B is a binomial with n = 10,000 and p = 1/2. We use the 
normal approximation to get 


5000 — —1/2 5000 — 1/2 
P(B = 5000) ~ poem ate 29 < SOUS Ee 
JNPq SNP 


52 4 Binomial Random Variables 


The mean is np = 5000 and the standard deviation ./npg = 50. Thus, 
P(B = 5000) ~ P(—0.01 < Z < 0.01) ~ 0.008. 


So the probability of getting exactly 5000 heads is rather slim: less than 1%. 
Note that the most likely number of heads is 5000 (see the problems). However, 
there are so many possible values that any fixed number of heads is rather unlikely. 


Example 15 Assume that a fair coin is tossed 10,000 times. Let B be the number of 
heads. Find a so that B is between 5000 — a and 5000 + a with probability 0.99. 
The expected value for B is np = 10,000 x 1/2 = 5000. We want a so that 
P(5000 — a < B < 5000+ a) = 0.99. 
Using the normal approximation with np = 5000 and ./npq = 50, 


a a+1/2 


EC <Z< ) = 0.99. 
VPA V"Pq 
By the symmetry of the normal distribution, 
+ 1/2 
oP0<z< 1”) ~ 0.99, 
VNPA 


Thus, 


a =2.57,/npq — 1/2 ~ 128. 


Hence, the number of heads will be in the interval [5000 — 128, 5000 + 128] with 
99% confidence. This is a rather narrow interval considering that we are performing 
10,000 tosses. 

The important lesson of this example is that the number of successes of a 
binomial with parameters n and p is in the interval 


(np — (2.57,/npq — 1/2), np + (2.57./npq — 1/2)) 


with probability 0.99. In particular, typical deviations from the mean np are of order 


mp4. 


4 The Negative Binomial 53 


Problems 


11. 


12. 
13. 


14. 


15. 


16. 


17. 


18. 


4 


Let Z be a standard normal random variable. Compute the following: 


(a) P(Z > 1.52). 

(b) P(Z > —1.15). 
(c) P(-1 < Z < 2). 
(d) P(-2 < Z < —-1). 


Let Z be a standard normal random variable. Find c so that P(Z > c) = 0.99. 
We send out 500 invitations for an event. Assume that each person shows up 
independently of the others with probability 0.6. 


(a) What is the probability that 280 or less people show up? 
(b) Find b so that the number of people who show up is b or larger with 
probability 0.9. 


Roll a fair die 360 times. 


(a) What is the probability to get exactly sixty 1’s? 
(b) Find a so that the number of 1’s is in the interval [60 — a, 60 + a] with 
probability 95%. 


Toss a fair coin 100 times. 


(a) What is the probability of getting exactly 50 heads? 
(b) Assume that 25 probability students toss a fair coin 100 times each. What 
is the probability that at least one student gets exactly 50 heads. 


A gambler bets repeatedly 1 $ on red at the roulette (there are 18 red slots and 
38 slots in all). He wins 1$ if red comes up and loses 1$ otherwise. What is the 
probability that he will be ahead: 


(a) After 100 bets? 
(b) After 1000 bets? 


Assume that each passenger shows up independently of the others with 
probability 0.95. How many tickets should the airline sell for a flight on an 
airplane with 200 seats so that there is no overbooking with probability 0.99? 
Assume that 49 students each toss a fair coin 100 times. 


(a) What is the probability that at least one student gets exactly 50 heads. 
(b) What is the probability that at least 3 students get at least 60 heads? 


The Negative Binomial 


Example 16 Roll a fair die. What is the probability that the second 6 appears at the 
10th roll? 


54 4 Binomial Random Variables 


The event A =“the second 6 appears at the 10th roll” is the intersection of two 
events, B =“‘there is exactly one 6 in the first 9 rolls” and C =“the 10th roll is a 6.” 
Note that B and C are independent. Thus, 


P(A) = P(B)P(C). 


The number of 6’s in the first 9 rolls is a binomial with n = 9 and p = 1/6. Hence, 


P(A) = () (5/6)8(1/6) x (1/6) ~ 0.06. 


e We perform identical and independent trials, each trial having a probability of 
success equal to p. Let r be an integer larger than or equal to 1. Let B, be the 
number of trials to get the r-th success. Then B, is called a negative binomial 
random variable with parameters r and p. 


It is easy to write a general formula for the distribution of a negative binomial 
random variable, see the problems. It is however better to work out the formula in 
each case as we do in the examples. 


Example 17 What is the probability that the fifth child of a couple is their second 
girl? 
The probability of exactly one girl among the first 4 children is 


({) ara, 


The probability that the fifth child is a girl is 1/2. Hence, the probability that the 
fifth child of a couple is their second girl is 


4 , 4 
(9 (1/2)(1/2)°(1/2) = 55 = 1/8. 


We now turn to the mean and variance of a negative binomial random variable. 


¢ Let B, be a negative binomial random variable with parameters r and p. Then, 


NES ava 
P P 


We derive the formulas above by writing B, as a sum of i.i.d. random variables. 
This is the same method we used to compute the expectation and variance of a 
binomial random variable. Let G; be the number of trials until the first success; 
then G, is a geometric random variable with parameter p. Let G2 be the number of 
trials from the first success until the second success. The random variable G2 is also 
geometric and G, and G2 are independent. More generally, we define G; to be the 


Problems 55 


number of trials between the (i — 1)th success and the ith success fori = 1, 2,...,r. 
The number of trials up to the rth success is then 


By = Gi + Go+-:-+G,. 


All the G; are independent and identically distributed according to a geometric with 
parameter p. Recall that E(G,) = 1/p. Thus, 


EB)=7EG YS =. 
P 


By using the independence of the G; and the fact that Var(G,) = (1 — p)/p, 


Var(B,) =rVar(G)) = AS) 
Pp 


Problems 


19. Toss a fair coin. What is the probability that the third head occurs at the 8th 
toss? 

20. What is the probability that the fifth child of a couple is their third girl? 

21. Roll two dice. 


(a) What is the probability of getting the second pair of 6’s at the 10th roll? 
(b) What is the expected number of rolls to get the second pair of 6’s? 


22. Rolla die. 


(a) What is the probability of getting the first 6 at or before the fifth roll? 

(b) What is the probability of getting the third 6 at the 10th roll? 

(d) Given that the second 6 occurred at the 10th roll, what is the probability 
that the first 6 occurred at the 5th roll? 


23. Items are examined sequentially at a manufacturing plant. The probability that 
an item is defective is 0.05. What is the expected number of examined items 
until we get the fifth defective? 

24. In this problem we find the distribution of B,, a negative binomial random 
variable with parameters r and p. 


(a) Show that the distribution of B, is given by 


1 


PB, == (Eo 


) P= py" for k= rr tans 


(b) Let r = 1 in the formula in (a). What is the other name of this distribution? 


Chapter 5 ®) 
Poisson Random Variables Cheek for 


1 Poisson Probability Distribution 


We start with the definition. 


¢ The random variable N is said to have a Poisson distribution with mean 2 > 0 if 


k 
as 
kd 


P(N =k) =e fork =0,1,.... 

At the end of this chapter we will show that a random variable with the 
distribution above has indeed mean i. 

Typically the Poisson distribution appears when we count the number of 
occurrences of events that have small probabilities and are independent of each 
other. 


Example I Consider a fire station that serves a given neighborhood. Each resident 
has a small probability of needing help on a given day, and most of the time people 
need help independently of each other. The number of calls a fire station gets on a 
given day may be approximated by a Poisson random variable with mean 2. The 
parameter 1 may be taken to be the observed average. Assume that A = 6 calls per 
day. What is the probability that a fire station gets 2 or more calls in a given day? 


P(N > 2)=1-— P(N <2) 
=1-— P(N =0)— P(N =1) 


=1—e%—ie 


Slee *W4%) 
=1-7e~* ~ 0.98. 
© Springer Nature Switzerland AG 2022 57 


R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_5 


58 5 Poisson Random Variables 


Example 2 Assume that a book has an average of 1 misprint every 10 pages. What 
is the probability that a given page has no misprint? 

Considering all the words in a given page, we may assume that each one of them 
has a small probability of being misprinted. We may also assume that each word is 
misprinted independently of the other words. With these assumptions the Poisson 
distribution is adequate. The mean number of misprints per page is 4 = 0.1. Thus, 


P(N =0)=e% =e 9! ~ 0.9. 


2 Poisson Scatter Theorem 


We now turn to a property that shows that the Poisson distribution is bound to appear 
in many situations. Consider a finite interval J that gets random hits (the interval 
may represent a time interval and the hits may represent incoming telephone calls). 
Assume the following two hypotheses: 


(1) A given point of J may get hit at most once. 
(2) Divide J into equal subintervals and then each subinterval gets hit with the same 
probability and independently of the other subintervals. 


¢ Poisson scatter theorem. Under hypotheses (1) and (2) there is a number 
2 > O such that the total number of hits on J follows a Poisson distribution 
with mean 4. Let L be the length of 7 and then any subinterval of J with 
length @ has a Poisson distribution with mean A¢/L. 


For the ideas in the proof of this theorem, see Probability by J. Pitman. 


Example 3 Consider a telephone exchange on a Monday from 2:00 to 3:00 p.m. 
Assume that there is an average of 120 calls during this time period. What is the 
probability of getting at least 4 calls in a 3 min interval? 

It may be reasonable to assume that hypotheses (1) and (2) hold (the only 
question about this is whether each subinterval of time is equally likely to get calls). 
Then according to the Poisson scatter theorem the number of calls during a 3 min 
interval follows a Poisson distribution with mean 120 x 3/60 = 6. 


P(N > 4) =1- (P(N =0)+4+ P(N = 1) 4+ P(N =2) + P(N = 3)) 
a1-e*(14.4544) 
9° 3 
~ 0.85. 


The Poisson scatter theorem holds in any dimension. For instance, it may be used 
to count the number of stars that appear on a photographic plate or the number of 


3 Poisson Approximation to the Binomial 59 


raisins in a cookie. In the first case we replace length by area, and in the second one 
we replace length by volume. 


Example 4 Assume that rain drops are hitting a square with side 10 in. Assume that 
the average is 30 drops per minute. What is the probability that a sub-square with 
side 2 in. does not get hit in a given minute? 

Again it seems reasonable to assume that hypotheses (1) and (2) hold. The 
number of rain drops in the sub-square follows a Poisson distribution with mean 
30 x 27/10* = 1.2. Thus, 


P(N =0) =e!? ~ 0.3. 


3 Poisson Approximation to the Binomial 


¢ Let B bea binomial random variable with parameters n and p. Let N be a Poisson 
random variable with mean 4. = np and then for every k > 0 


P(B =k) ~ P(N =k) for small p. 


We have seen in the previous chapter that a binomial can be approximated by a 
normal distribution. The normal approximation is valid when n is large enough so 
that np > 5 and n(1 — p) => 5. In contrast the Poisson approximation works best 
when p is small. 

Thanks to the Poisson approximation, we replace a two parameters distribution 
by a one parameter distribution and we avoid the computation of binomial coeffi- 
cients. Note that if B is a binomial random variable with parameters n and p and 
then 


PRO a0 =p). 
Recall from Calculus that 
.. In(1— p) 
Mm ——_— — 


li 
po —p 


1. 


Therefore, 


P(B =0) =(1— p)” 
=exp(n In(1 — p)) 
~ exp(—np) 
=P(N = 0) 


60 5 Poisson Random Variables 


where the approximation holds for p small enough. In order to prove that a binomial 
with small p may be approximated by a Poisson, we need to show that for every 
k > O itis true that P(B = k) ~ P(N = k). This will be proved in the Coupling 
chapter. 


Example 5 During a recent meteor shower, it was estimated that the probability 
of a given satellite to be hit by a meteor is 1/1000. Assuming that there are 500 
satellites around the Earth and that they get hit independently of each other, what is 
the probability that no satellite will be hit? 

Let B be the number of satellites hit. Under these assumptions B has a binomial 
distribution with parameters n = 500 and p = 1/1000. Thus, 


P(B = 0) = (1 — 1/1000)°° ~ 0.6064. 
We now use the Poisson approximation with A = np = 1/2. We have 
P(N =0) =e"* =e |? ~ 0.6065. 
One can see that the approximation is excellent in this case. 
What is the probability that 3 or more satellites are hit? 


This time we want 


P(N > 3) =1— P(N = 0) — P(N = 1) — P(N =2) 


2 
=l—e*-—he%* e 

2 

2 
=l-—e”% (1424 =) 


4 Approximation to a Sum of Binomials 


In many situations we may need more involved models than a binomial. For 
instance, in the case of cancer the probability of getting the disease increases 
significantly with age. So a more realistic model should separate people in age 
classes. The total number of cancer cases is then a sum of binomial random variables 
with different p’s. This is not a binomial random variable. However, the next result 
shows that we may still use the Poisson approximation when all the p’s are small. 


¢ Let By, for i = 1...r, be independent binomial random variables with 
parameters n; and p;. Let 


A=nypi t+: +N, pr. 


Problems 61 


Then, B, + Bo +----+ B; can be approximated by a Poisson random variable 
with mean A, provided all the p; are small. 


Example 6 Assume that a hospital serves 100,000 people that are in 3 different 
class ages. Assume that an individual in class i has a probability p; (independently 
of all the other individuals) of getting a certain disease. Class 1 has ny = 50,000 
individuals and pj = 2 x 10-5, class 2 has ny = 30,000 individuals and P2 = 
5 x 107°, and class 3 has n3 = 20,000 individuals and p3 = 10~*. What is the 
probability that on a given year this hospital sees 3 or more cases of the disease? 

For each class i the number of cases B; follows a binomial with parameters n; 
and p;. We are interested in the event B} + B2+ B3 > 3. Since the B; are independent 
and the p;’s are small, we may use the Poisson approximation. Let 


A=n1pi tn2p2+73p3 = 4.5. 
Let N be a Poisson random variable with mean 2. We have 


P(B, + Bo + B3 > 3) ~ P(N > 3) 
=1—(P(N =0)+ P(N = 1)+ P(N =2)) 


ee we 
=l-e Dae 


~ 0.83. 


Problems 


1. Suppose that cranberries muffins have an average of 6 cranberries. 


(a) What is the probability that a muffin has exactly 4 cranberries? 
(b) What is the probability that half a muffin has 2 or less cranberries? 


2. Assume that you bet 200 times on 7 at the roulette (there are 38 possible slots). 
What is the probability that you win exactly 3 times? 

3. Assume that 1000 individuals are screened for a condition that affects 1% of 
the general population. What is the probability that exactly 10 individuals have 
the condition? 

4. Assume that an elementary school has 500 children. 


(a) What is the probability that exactly one child was born on April 15? 
(b) What is the probability that at least 3 children were born on April 15? 


62 


10. 


11. 


12. 


5 Poisson Random Variables 


. The number of incoming phone calls at a telephone exchange is modeled using 


a Poisson distribution with mean A = 2 per minute. 


(a) What is the probability of having no calls during a given minute? 
(b) What is the probability of having 5 or less calls in a 3 min interval? 


. Suppose that the probability of a genetic disorder is 0.05 for men and 0.01 for 


women. Assume that 50 men and 100 women are screened. 
Compute the probability that exactly two individuals among the 150 who have 
been screened have the disorder. 


. Assume that 1% of men under 20 experience hair loss and that 10% of men 


over 30 experience hair loss. A sample of 20 men under 20 and 30 men over 30 
are examined. What is the probability that 4 or more men experience hair loss? 


. Let N be a Poisson random variable with mean i. If A is large enough, the 


Poisson distribution can be approximated by a standard normal. That is, 


BEE oP ek 
—— ~ Zash > +00. 
Va 
(a) Assume that 7 = 10. What is the exact probability that VN = 10? 
(b) Use the normal approximation to compute the probability in (a). 


. A company has 3 factories A, B, and C. A has manufactured 1000 items, B has 


manufactured 1500 items, and C has manufactured 2000 items. Assume that 
the probability that an item be defective is 0.003 for A, 0.002 for B, and 0.001 
for C. What is the probability that the total number of defective items is 7 or 
larger? 

In average there is one defect per 100 meters of magnetic tape. 


(a) What is the probability that 150 m of tape have no defect? 

(b) Given that the first 100 m of tape had no defect, what is the probability that 
the whole 150 m have no defect? 

(c) Given that the first 100 m of tape had at least one defect, what is the 
probability that the whole 150 m have exactly 2 defects? 


Assume you bet 1$ 100 times on 7 (there are 38 equally likely slots). If 7 comes 
up, you win 35$; otherwise you lose your 1 $. 


(a) What are your expected winnings? 
(b) What is the probability that you are ahead after 100 bets? 
(c) What is the probability that you have lost 100$? 


Assume that on average there are 5 raisins per cookie. 


(a) What is the probability that in a package of 10 cookies all the cookies have 
at least one raisin? 

(b) How many raisins should each cookie have in average so that the probabil- 
ity in (a) is 0.99? 


5 Mean and Variance 63 


13. Assume that books from a certain publisher have an average of one misprint in 
every 20 pages. 


(a) What is the probability that a given page has two or more misprints? 
(b) What is the probability that a book of 200 pages has at least one page with 
two or more misprints? 


14. The number of incoming phone calls at a telephone exchange is modeled using 
a Poisson distribution with mean A calls per minute. 
Show that given that there were n calls during the first minutes the number of 
calls during the first s < ¢ minutes follows a binomial with parameters n and 


p=s/t. 


5 Mean and Variance 


Recall from Calculus that 


[o,@) 
x a 
e= ) a for every real number x. 
4, Kt 


ll 


In particular we see that if NV is a Poisson random variable with mean A, we have 


ioe) oo yk 
_ hy _p-A 
Pv ab met * 
k=0 k=0 
=e *¢e 


=1. 
This shows that the Poisson distribution is indeed a probability distribution. 


e¢ Let N be a Poisson random variable with mean 2. Then, 
E(N) = xX and Var(N) =i. 


We start with the expectation. 


E(N) =) CkP(N =k) 


k=0 
oo k 
Xr 
ya 
k\ 
k=1 
oo kl 


64 5 Poisson Random Variables 


By shifting the summation index we get 


3 ak-l _ 3 ak 7 oe 
rae eer 
Thus, 
E(N) =e*r 3 oe =e he = hi. 
f (k=)! 


We now turn to the computation of the variance of N. For the Poisson distribu- 
tion, E(N7) is obtained by first computing E(N(N — 1)). Note that V(V — 1) = 0 
for N = 0 and for N = 1. Thus, 

[o.@) ‘ ak 
E(N(N — 1)) = So kk —De a 
k=2 


(oe) a 
nk-2 


— pA 2 
=e “x SS &-D! yy 
k=2 


We shift the summation index to get 


Therefore, 
E(N(N — 1)) =e *A7e = 2?. 
So E(N(N — 1)) = 22. By the linearity of the expectation, 
E(N(N — 1)) = E(N? — N) = E(N’) — E(N). 
We solve for E(N”) 
E(N*) = E(N(N — 1) + E(N) =A? +A. 
Hence, 


Var(N) = E(N’) — E(N)* =X. 


Chapter 6 m®) 
Simulations of Discrete Random hook for 
Variables 


1 Random Numbers 


It is one thing to know that a Bernoulli random variable is | with probability p and 
0 with probability | — p. It is something else to generate a million Bernoulli random 
variables and to look at them. One can see what randomness really looks like. We 
are more often than not surprised. Human intuition on randomness is not good, and 
simulations are very helpful in correcting false ideas. 

A random generator is a computer program that generates random numbers. 
These random numbers are not random at all, and they are produced by a 
deterministic algorithm. However, a good random generator will produce numbers 
that mimic randomness enough for most purposes. 

Random numbers are distributed according to a uniform distribution on the 
interval (0, 1). A uniform random variable U has a flat continuous distribution. 


This distribution will be studied in more detail in the chapter on continuous 
random variables. The only property we need in this chapter is that for any 0 < 
p<t, 


PO<U <p)=p. 


© Springer Nature Switzerland AG 2022 65 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_6 


66 6 Simulations of Discrete Random Variables 
2 Bernoulli Random Variables 


To simulate a Bernoulli random variable with parameter p we use the following 
algorithm: 


e We generate a random number U (i.e., a random variable uniformly distributed 
on (0, 1)). 
e IfU < p, let X = 1, andif not let X = 0. 


The random variable X is a Bernoulli random variable with parameter p. This is 
so because 


PX =1)=PU <p)=p 
and 
P(X =0)= PU > p)=1-p. 


We now apply this algorithm. 


Example I We use the algorithm above to simulate two different Bernoulli random 
variables. 


Random numbers 0.814 0.905 0.127 0.913 0.632 0.197 0.278 
Bernoulli p = 1/2 0 0 1 0 0 1 1 
Bernoulli p = 1/6 0 0 1 0 0 0 1 


For instance, the random number 0.197 < 1/2 and 0.197 > 1/6. Hence, the 
Bernoulli for p = 1/2 is 1 and the Bernoulli for p = 1/6 is 0. 

We can use the Bernoulli distribution with p = 1/2 to simulate 9 tosses of a 
fair coin. The first line of the simulation gives us the following run: TTHTTHH. A 
common mistake is to believe that tosses of a fair coin will more or less alternate 
between H and T. It is true that the fraction of H’s will tend to 1/2 as the number of 
tosses increases. However, there will be long runs of only H’s and of only T’s. For 
instance, if we toss the coin 200 times, we should expect the longest run of H’s or 
T’s to be between 5 and 8 (see Schilling (1990), The College Mathematics Journal 
21, pages 196-207). 

We can use the Bernoulli distribution with p = 1/6 to simulate the roll of a die. 
A | corresponds to rolling a 6, and a 0 to rolling anything but a 6. We see in the 
second row of the table above that we roll a 6 at the third and at the seventh roll. 


3 Binomial Random Variables 67 
3 Binomial Random Variables 


Example 2 Consider the following 10 x 5 matrix of random numbers. 


0.1576 0.6557 0.7060 0.4387 0.2760 
0.9706 0.0357 0.0318 0.3816 0.6797 
0.9572 0.8491 0.2769 0.7655 0.6551 
0.4854 0.9340 0.0462 0.7952 0.1626 
0.8003 0.6787 0.0971 0.1869 0.1190 
0.1419 0.7577 0.8235 0.4898 0.4984 
0.4218 0.7431 0.6948 0.4456 0.9597 
0.9157 0.3922 0.3171 0.6463 0.3404 
0.7922 0.6555 0.9502 0.7094 0.5853 
0.9595 0.1712 0.0344 0.7547 0.2238 


Assuming that our random generator is generating independent identically 
distributed (i.i.d.) uniform random variables, we apply the algorithm of the last 
section to transform this matrix into a matrix of i.i.d. Bernoulli random variables 
with any fixed p in (0,1). Let p = 1/2. We get the following matrix of iid. 
Bernoulli random variables with p = 1/2. 


10011 
01110 
00100 
10101 
Oo111 
10011 
10010 
01101 
00000 
01101 


Recall now that a sum of 7 i.i.d. Bernoulli random variables with parameter p 
has a binomial distribution with parameters n and p. By summing every row of 
the matrix above, we get 10 simulations of a binomial with parameters n = 5 and 
p = 1/2. They are: 3, 3, 1, 3, 3, 3, 2, 3, 0, 3. 


3.1 Computational Formula for the Binomial Distribution 


In order to compare our simulations to the exact distribution of a binomial, it 
is helpful to have a method to compute the exact distribution without having to 


68 6 Simulations of Discrete Random Variables 


compute factorials. Factorials grow very fast, and their computation will eventually 
overwhelm any computer. 


¢ Let B be a binomial random variable with parameters n and p. Then, 
P(B = 0) =(1—p)" 
and 
p n-k+1 


P(B =k) = ——— P(B =k —- 1) fork =1,2...,n. 
1-—p k 


We now derive the computational formula. Recall that for0 < k <n, 


P(B=k) = (7) pe — py, 


where 


n = n!} 
(7) ~ kin — bY 


3 P = = 
w= py k_ — Lr py k+l 


Let k > 1. Observe that 


ie 
Using that k! = k(k — 1)! and that (n —K)! = (n -—k + D!I/(n —k + J), 


n! _n-k+1 n! 
k(n —k)! k (k—-—1)!\n—k+1)! 


—na~k+1 n 
eae k-1)° 


We get 


n-k+1 n = fe 
P(B=h= Pp ( ) ot 1q — py" k+1 
Pp k-1 


T= k 

agen 
fT ey: 
l-p k 


We now apply the preceding formula to an example. 


Example 3 Find the distribution of a binomial random variable with n = 20 and 
p=0.2. 


4 Poisson Random Variables 69 


We have that 
P(B =0) = (1— p)” = (0.8)? ~ 0.0115. 


We use the recursion for k = 1, and note that p/(1 — p) = 1/4, 
1 
P(B=)N)= i x 20 x P(B = 0) ~ 0.0576. 


For k=2 we get 


1 19 
P(B = 2) = 7x > x P(B= 1) ~ 0.1369. 


We go on like this to get all the P(B = k) forO < k < 20. We summarize the 
distribution in the table below. 


k 0 1 2 3 4 5 6 7 
P(B=k) |0.0115 |0.0576 | 0.1369 | 0.2054 | 0.2182 |0.1746 | 0.1091 | 0.0545 


We only record the probability values up to k = 7. For k > 7 the probabilities 
get very small. We now simulate this distribution. 


k 0 1 2 3 4 5 6 ot 
Sim.1 | 0.0150 /|0.0540 (0.1590 | 0.1830 | 0.2140 {0.1670 /|0.1100 | 0.0630 
Sim.2 | 0.0114 | 0.0589 | 0.1392 | 0.2023 | 0.2185 |0.1741 /|0.1103 | 0.0525 


Simulation 1 was done by simulating 1000 binomial random variables with 
parameters n = 20 and p = 0.2. We then computed the observed frequencies for 
all k between 0 and 20. For instance, we observed 15 zeros among the 1000 runs. 
Hence, the observed frequency is 15/1000 = 0.0150. 

Simulation 2 used 10,000 binomial random variables. The more random vari- 
ables we use the closer the simulation is to the exact distribution. 


4 Poisson Random Variables 


Let N be the Poisson random variable with mean 2. Recall that the distribution of 
N is given by 


k 


Xr 
_ 4k 


70 6 Simulations of Discrete Random Variables 


In order to avoid factorials we may use the following computational formula for the 
distribution of NV. 


PV ==2e" 
i 
P(N =k) = [P(N =k —1) forall k > 1. 


This formula is easy to derive, see the problems. We now apply it in the case 
A=1. 


k 0 1 2 3 4 5 
P(N =k) 0.3679 0.3679 0.1839 0.0613 0.0153 0.0031 


The support of N is infinite, but for k > 6 the probabilities start being very small 
and we omit them. 

We now turn to the simulation of a Poisson random variable. Recall that a 
binomial with parameters n and p can be approximated by a Poisson with mean 
2. = np. The smaller the p the better the approximation. Since we have an efficient 
method to simulate a binomial random variable, we use it to simulate a Poisson 
random variable. 


k 0 1 2 3 4 5 ; 

Simulation 1 0.3493 0.3852 0.1925 0.0613 0.0095 0.0019 

Simulation 2 0.3635 0.3641 0.1919 0.0631 0.0136 0.0032 
In simulation 1 we run 10,000 binomial random variables with n = 10 and 


p = 0.1. In simulation 2 we run the same number of binomial random variables 
with n = 100 and p = 0.01. Hence, in both cases np = 1. As predicted simulation 
2 is more accurate due to the smaller p. 


Problems 


1. Use the matrix of random numbers in Example 2 to simulate 10 binomial 
random variables with n = 5 and p = 1/6. 

2. Use the matrix of random numbers in Example 2 to simulate 5 binomial random 
variables with n = 10 and p = 1/4. 

3. Use the following algorithm to simulate the roll of a die. Let U be a random 
number. 
IfU < 1/6, let X = 1. 
If 1/6 < U < 2/6, let X = 2, and so on. 


Problems 71 


(a) Finish writing the algorithm we started above. 
(b) Explain why this algorithm simulates the roll of a fair die. 
I Use the 7 random numbers of Example | to simulate 7 rolls of a fair die. 


4. Use the 7 random numbers of Example | to simulate a geometric random 
variable with p = 0.2. (Use the random numbers to get Bernoulli random 
variables with the appropriate p and recall that a geometric is the number of 
trials to achieve the first success.) 

5. (a) Use the matrix of random numbers in Example 2 to simulate 10 Poisson 

random variables with 7 = 0.5. 
(b) Compare the simulation in (a) to the exact distribution of a Poisson with 
A= 0.5. 
6. Prove the computational formula for the Poisson distribution. That is, show that 


Xr 
P(N =k) = One ee L; 


7. Use Problem 6 to compute the first 10 terms of the Poisson distribution with 
A=4., 

8. Recall that a mode M (not necessarily unique) of a discrete random variable B 
is such that P(B = M) is the maximum of all the P(X = k). In this problem 
we derive a formula for the mode of a binomial. Let B be a binomial with 
parameters n and p. 


(a) By using the computational formula for the binomial distribution, show that 


if and only ifk <np+ p. 
(b) Use (a) to show that the sequence P(B = k) is increasing fork < np + p 
and decreasing fork > np + p. 
a. (c)] Use (b) to show that the maximum of the sequence P(B = k) is 
attained atk = [np + p], where |x] is the largest integer smaller than 
x. 
(d) What is the mode of a binomial with n = 30 and p = 1/6? 


9. In this problem we find a formula for the mode of a Poisson distribution. 


(a) Use that 
Xr 
P(IN=kK)= Fe 1) fork > 1 


to show that P(N =k) > P(N =k — 1) if and only ifa > k. 
(b) Show that the sequence P(N = k) attains a maximum for k = [A] (e., 
the largest integer smaller than A). 


72 


10. 


6 Simulations of Discrete Random Variables 


(c) Show that if 4 is an integer, then there are two modes A and A — 1. Show 
that if A is not an integer, then there is a unique mode that is the largest 
integer smaller than 2. 


In 1975, in Columbus Ohio, there were 12 cases of childhood leukemia. The 
expected number is 6 per year (Morbidity and mortality weekly report, July 
25 1997, pp. 671-674). Assume that there are 200,000 children under 15 in 
that area and that each one has the same probability 3 x 10~> of being hit by 
leukemia in a given year. 


(a) Compute the probability of having 12 or more cases of leukemia in a given 
year. 

(b) Assume that there are 200 regions in the United States with the same 
number of children and the same probability for each child to be struck 
by leukemia. What is the probability that at least one region will get 12 
cases or more? 

(c) Considering (a) and (b), would you attribute the cluster in Columbus to 
chance alone? 


Chapter 7 ®) 
Combinatorics Cheek for 


1 Counting Principle 


Before stating the fundamental principle of counting we give an example. 


Example I Assume that a restaurant offers 5 different specials and for each one of 
them you can pick either a salad or a soup. How many choices do you have? 

In this simple example we can just enumerate all the possibilities. Number the 
specials from | to 5, and let S denote the salad and O denote the soup. There are 10 
possibilities: 


(1,S), (2,8), (3,8), (4,8), (5, S) 
(1,0), (2,0), (3,0), (4,0), (5, O) 


We have two selections to make, one with 2 choices and the other one with 5 
choices. Thus, in all there are 2 x 5 = 10 choices. 


¢ The multiplication rule. If we have r successive selections with nx choices at 
the kth step, fork = 1...r, then in all we have n; x n2 x --- x n; possibilities. 


Example 2 Consider an answer sheet with 5 categories for age, 2 categories for sex, 
and 3 categories for education. How many possible answer sheets are there? 

In this example we have r = 3, n, = 5, nz = 2, and n3 = 3. Thus, in all, there 
are 5 x 2 x 3 = 30 possibilities. 


Example 3 Ina true/false test there are 10 questions. How many different ways can 
this test be answered? 

This time we have 10 successive selections to be made, and for each selection we 
have two choices. In all there are 2 x 2 x --- x 2 = 2!° possibilities. 


Example 4 How many arrival orders are there for 3 runners? 
We call the three runners A, B, and C. A has three possible arrival positions. 
Once the arrival of A is fixed, then B has only two possible arrival positions. Once 


© Springer Nature Switzerland AG 2022 73 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_7 


74 7 Combinatorics 


the arrivals of A and B are fixed, there is only one possible arrival position for C. 
Thus, we may use the multiplication rule to get that in all there are3 x 2x 1 =6 
possibilities. 

The preceding example illustrates a consequence of the multiplication rule that 
is particularly important. 


e A particular labeling of n distinct objects is called a permutation of these n 
objects. The number of possible permutations of n objects is n!. 


Note that in Example 4 we are counting the number of permutations of 3 runners. 
The number of permutations is 3!=6. 


Example 5 How many ways are there to put 7 different books on a shelf? 

Again we need to count the number of permutations of 7 distinct objects. We get 
7!=5040 possibilities. 

Note that the factorials can be computed inductively by using the formula 


n!i=nx(n—1)! 


Factorials grow very fast, see Stirling’s formula in the problems. 

We now turn to the task of counting the number of ordered subsets out of a set. 
For instance, consider the set {a, b, c}. How many ordered subsets of two distinct 
elements do we have? 

In this simple case we can enumerate all the ordered subsets with 2 distinct 
elements. We have (a, b), (a,c), (b, a), (b,c), (c, a), and (c, b). There are exactly 
6 ordered sets of 2 out of 3 elements. 

More generally: 


e The number of ways to pick an ordered set of distinct k elements out of n 
elements is 


nx (n—1)x (n-—2)x---x (n-—k+1). 


We now prove this formula. For the first object we pick we have n choices, for 
the second one we have n — 1 choices, for the third one n — 2 choices, and so on. 
For the k-th object we have (n — k + 1) choices. So according to the multiplication 
rule we haven x (n — 1) x (9-2) x --- x (n —k +1) ways to pick an ordered set 
of k objects. 

In many situations we want to pick a set of (non-ordered) k objects among n 
objects where k < n. How many ways are there to do that? Consider the set {a, b, c}. 
How many ways are there to pick an unordered subset of 2 distinct elements? There 
are exactly 3 such subsets, {a, b}, {a, c}, and {b, c}. More generally: 


e The number of ways to pick a non-ordered set of k distinct elements out of 7 is 
given by the binomial coefficient 


n = n! 
(7) ~ kin — ky 


1 Counting Principle 715 


To prove this formula we will find a simple relation between the numbers of 
ordered and non-ordered subsets. Let c(k, n) be the number of ways to pick a non- 
ordered subset of k distinct objects out of n objects. We know that for every set of 
k objects there are k! ways to order it. Thus, the number of ways to pick an ordered 
set of k objects is 


k'c(k, n). 


On the other hand, we also know this number to ben x (n— 1) x --- x (n—kK+1). 
Hence, 


Kic(k,n) =n x (n—1) x (n—2)x---x (n-—k+1). 


Observe that 


n! 


nx W—1) x @—2) xx @—k+) = e—. 


Thus, 


n! 
c(k, n) = Mam—b! 


That is, c(k, n) turns out be the binomial coefficient (7). 


Example 6 Three awards will be given to three distinct students in a group of 10 
students. How many ways are there to give these 3 awards? 

We want to know how many subsets of 3 students can be picked out of a set of 
10 students. This is exactly 


= — = 120. 
3 317! 3x2 


i) 10! «10x 9x8 
Example 7 In a contest, 10 students will be ranked and the top 3 will get gold, 
silver, and bronze medals, respectively. How many ways are there to give these 3 
medals? 

This is different from Example 6 because the order of the 3 students picked is 
important. There are 10 possible choices for the gold medal, there are 9 choices 
for the silver medal, and there are 8 choices for the bronze. So according to the 
multiplication rule, there are 10 x 9 x 8 ways to give these medals. That is 720 
ways. Note that this is 6 (i.e., 3!) times more ways than in Example 6. 


Example § Ina business meeting 7 people shake hands. How many handshakes are 
there in all? 


76 7 Combinatorics 


There are as many handshakes as there are sets of 2 people among 7. So the 


number is 
7! 
U = — =21. 
2 2!5! 


Example 9 How many distinct strings of letters can be made out of the word 
CARE? 

Every permutation of these four distinct letters will give a distinct string of letters. 
Thus, there are 4! = 24 distinct strings of letters. 


Example 10 How many distinct strings of letters can be made out of the word 
PEPPER? 

There are 6! possible permutations of these 6 letters. However, there are only 4 
distinct letters in this word. For instance, if we permute the P’s only (there are 3! 
such permutations), we get the same string of letters. If we permute the E’s only 
(there are 2! such permutations), we also get the same string. Thus, there are 


6! 


— = 60 
213! 


distinct strings of letters. 


Example 11 How many distinct strings can we make with three 1’s and two 0’s? 
This is very similar to Example 10. Note that there are 5! permutations, but since 
there are three 1’s and two 0’s, the total number of distinct strings is 


5! 


—— = 10. 
312! 


Problems 


1. Someone has three pairs of shoes, two pairs of pants, and four shirts. In how 
many ways can she get dressed? 

2. A test is composed of 12 questions. Each question can be true, false, or blank. 
How many ways are there to answer this test? 

3. In how many ways can 7 persons stand in line? 

4. Two balls are red and three are blue. How many ways are there to line the balls? 

5. License plates have 3 letters and four numbers. How many different license 
plates can be made? 

6. Inaclass of 21 in how many ways can a professor give out 3 A’s? 

7. Inaclass of 21 in how many ways can a professor give out 3 A’s and 3 B’s? 


2 Properties of the Binomial Coefficients 77 


8. Assume that 8 horses are running and that three will win. 


(a) How many ways are there to pick the unordered three winners? 
(b) How many ways are there to pick the ordered three winners? 


9. According to Stirling’s formula, 
nl ~ Sante", 


That is, the ratio of the two sides tends to | as n goes to infinity. Use Stirling’s 
formula to approximate 10!, 20!, and 50!. How good are these approximations? 


2 Properties of the Binomial Coefficients 


e Every n > 0, (5) =1. 


This can be shown easily by using the definition of a binomial coefficient and the 
fact that 0! = 1. 


¢ For all integers n > 0 andk > 0, 


n\ _ n 
kk)” \n-k]* 
There are usually at least two ways to prove this type of identity. The first is 
to just use the computational definition of the binomial coefficient and check the 


identity. This works fine in this case. The second way is to use the fact that (7) 


counts the number of subsets with k elements in a set of n elements. We now use 
the latter way. 
Each time we pick k out of n elements, we do not pick n — k out of n elements. 


So there are as many subsets with k elements (that is, (7) as there are subsets with 
n — k elements (that is, ( i‘ ) This proves the identity. 
ne 

¢ For all integers n > 1 andk > 1, 

n\ _({n-1 ft n—-1 

k})  \k=-1 ae i 

In order to prove this identity, fix a particular element out of the n elements and 

call it O. We have two possible types of subsets of k elements. The ones that contain 


78 7 Combinatorics 


O and the ones that do not contain O. There are A 7 subsets of k elements that 


contain O. This is so because if a subset has O, we still need to pick k — 1 elements 
out of m — 1. On the other hand there are 4 ‘ ') subsets of k elements that do not 


contain O. By adding the two preceding binomial coefficients we get all the subsets 
of k elements out of n. This proves the identity. 


¢ Pascal’s triangle. The following triangle is an efficient way to compute binomial 
coefficients. 


1 

11 

121 
133 1 
14641 
15101051 


The triangle is constructed thanks to the identity 


n\_ (n-1 % n—-1 
kk} \k-1 ae 
That is, the entry (n, k) (nth row, kth column) is the sum of the entries (n — 1, k) 


and (n — 1,k — 1). 


We can read (7) at the intersection of row n and column k. Note that the first 


2 : 4 
row is n = 0 and the first column k = O. For instance, = 6 can be read at the 


intersection of n = 4 (i.e., the fifth row) and k = 2 (i.e., the third column). 
We now turn to the Binomial Theorem. 


¢ Binomial Theorem. For any integer n > 0 and any real numbers a and b, 


n 
(a+by"=)° (7) akpr—*, 
k=0 
We will see why the theorem holds on a particular example. Take n = 4 and then 
(a+ b)* = (a+b) x (a+b) x (a+b) x (a+b). 
All the terms in the expansion come from these four products. So all the terms must 
have degree 4. That is, all the terms are of the type a'b/ where i + j = 4. To get a* 


we must pick a in each one of the four terms in the product, and there is only one 
way to do that. In the final expansion there is only one a*. To get a*b we need to 


Problems 719 


pick a’s from three of the four terms in the product, and there are (5) ways to do 


that. In the final expansion there are 4 a*b. To get a*b? we need to pick 2 a’s, and 
there are (5) = 6 ways to do that. The terms ab? and b* are treated in a similar 


way. We end up with 
(a+ b)* = a* + 4a°b + 6a7b* + 4ab? + b+. 


A nice application of the Binomial Theorem is the following: 
¢ Number of subsets A set with n elements has 2” subsets. 


For instance, consider the set {a, b}. It has the following subsets, @, {a}, {b}, and 
{a, b}. That is, it has 4 = 2? subsets. 

We now prove the formula. Assume a set A has n elements. A subset of A can 
have 0 elements (the empty set), 1 element, 2 elements, and so on. For any 0 < k < 


n, the number of subsets with k elements is (3). Hence, the total number of subsets 


of A is 


=) 


By the Binomial Theorem the last expression is also (1+ 1)”, that is, 2”. This proves 
the formula. 


Problems 


10. Use Pascal’s triangle to compute G) fork =0...10. 
11. Compute 

“(n 
> (7) (-1)*. 


k=0 


12. Expand (x + y)’. 
In the problems below, show the following identities. 


13. (1) =n. 
n n(n—1 
14. (:) = Me). 


80 7 Combinatorics 


n+t\_ (n\(é n £ n\ (£ 
"F') =) @) +) (2) ++ @) 6) 
(Consider a class with n boys and @ girls. How many groups of k children can 


we have?) 
16. Use 15 to show 


= 
i 


17. 


n 
3: (7) =n2"-!, 
k 


k=! 


(Consider a department with n professors. How many committees headed by 
one chairman can we form?) 
18. (a) Show that for a natural n > 1 


(l+x)" Lies BES Die lees 


(b) Use (a) to guess the sequence (ax) in the formula 


lee) 
+x)? =14 ye apne 
k=1 


3 Hypergeometric Random Variables 


Consider an urn with b blue marbles and r red marbles. Assume that we take 
randomly n marbles from the urn without replacement. Let X be the number of 
blue marbles in the sample. What is the distribution of X? 


There are tag ) ways to select n marbles among b + r marbles. All these ways 
are equally likely. There are (’) ways to select x blue marbles among b marbles. 
There are ‘é ee) ways to select n — x red marbles among r red marbles. Hence, for 


O<x<b 
Olt) 
("7") 


e The random variable X is said to have a hypergeometric distribution with 
parameters b, r, and n. 


P(X =x)= 


3 Hypergeometric Random Variables 81 


Rather than fitting the hypergeometric distribution to a specific problem, we use 
the counting principle to solve the problem. We give an example below. 


Example 12 Draw 4 cards without replacement from a deck of 52 cards. What is 
the probability of drawing exactly one ace? 
There are 4 aces in the deck so the number of ways to draw exactly one ace 


is C) The number of ways to draw 3 cards that are not aces is ae The total 


number of ways to draw 4 cards from a deck of 52 is Ca). Hence, the probability 


of drawing exactly one ace is 


What is the probability of drawing exactly 2 aces? 
In a similar way as above we get 


(") 


Example 13 You are dealt with 5 cards from a 52 cards deck. What is the probability 


~ 0.025. 


of getting a full house (three of a kind and a pair of another kind)? There are :) 
ways to pick the kind for the triplet; once the kind is picked, there are (3) ways to 
pick three cards of that kind. Then there are ) ways to pick the kind for the pair 


and then 6 ways to pick two cards of that kind. By the multiplication principle 


()G)()() 


ways to pick a full house. Assuming that all () hands are equally likely, we get 


there are 


that the probability of a full house is 


82 7 Combinatorics 


Example 14 You are dealt with 5 cards from a 52 cards deck. What is the probability 
of getting three of a kind? 

We want to compute the probability of exactly three of a kind. That is, we exclude 
four of a kind, a full house, and so on. 


There are (c) ways to pick the kind for the triplet. Once the kind of the triplet 


is picked, there are (3) ways to pick 3 cards to make a triplet. Then there are i) 
ways to pick the two remaining kinds. Once the kind of each remaining card has 
been picked, there are (‘) to pick one card for each kind. Thus, the number of 


ways to pick three of a kind is 
13\ (4) (12\ (4\ (4 
1 3) \2 1 XV) 


By dividing the formula above by & we get a probability of 0.02. 


Problems 


19. Draw five cards without replacement from a deck of 52 cards. Let X be the 
number of aces among the five cards. Find the probability distribution of X. 
20. There are 4 pairs of shoes in a box. 


(a) I pick two shoes at random, what is the probability that I get a pair? 

(b) I pick three shoes at random, what is the probability that I get a pair? 

(c) I pick four shoes at random, what is the probability of getting exactly one 
pair? 

(d) I pick four shoes at random, what is the probability of getting two pairs? 


4 Mean and Variance of a Hypergeometric 83 


21. 


22. 


23. 


24. 


25. 


4 


I get 13 cards from a deck of 52 cards. 


(a) What is the probability that I get all 4 aces? 
(b) What is the probability that I get no ace? 
(c) What is the probability that I get all 13 hearts? 


In a lottery you pick 5 different numbers between | and 50. Then 5 different 
numbers are drawn between | and 50. 


(a) What is the probability of winning this lottery? 

(b) What is the probability that none of your numbers are drawn? 

(c) What is the probability that exactly 4 of your numbers are drawn? 
(d) What is the probability that exactly 3 of your numbers are drawn? 


You are dealt with 5 cards from a 52 cards deck. What is the probability that: 


(a) You get exactly one pair? 

(b) You get two pairs? 

(c) You get a straight flush (5 consecutive cards of the same suit)? 
(d) A flush (5 of the same suit but not a straight flush)? 

(e) A straight (5 consecutive cards but not a straight flush)? 


Show that 


This shows that the hypergeometric distribution is indeed a probability distribu- 
tion. (Use one of the appropriate binomial identities from the previous section.) 
Consider an urn with r red balls and b blue balls. Draw n balls with 
replacement. Let X be the number of red balls among the 7 balls drawn. What 
is the distribution of X? Specify the parameters. 


Mean and Variance of a Hypergeometric 


Let X be a hypergeometric random variable. The following identity is the main 
ingredient in the computations of the mean and variance of X. We will show that for 
l<x<m, 


One) 0 


84 7 Combinatorics 


We now prove (1). 


By definition of the expectation, 


E(X) = a Px =x) 


x=1 
We now use (1) to get 


and 


b-er\ . bb (br 
n =n n-1 } 
Plugging these two expressions back in F(X), we get 


wo = Soefaed 
ber co 1 


Note that the last sum is the sum of all the probabilities of a hypergeometric with 
parameters b — 1, r, andn — 1. This sum is therefore 1. Hence, 


E(X)=n 


b+r- 


We now turn to the computation of the variance. It is easier to compute E(X (X — 
1)) than to compute E(X 2). This is our starting point. 


4 Mean and Variance of a Hypergeometric 85 


By using (1) twice for x > 2, 


x(x — v(%) =(x — vx(?) 
— 1)b : 
-e-0() 


b—2 
=b(b — v(. a ) 


Similarly, we use (1) twice to get 


Deb r\. = ADT Dr 1). (baer 2 
( )- n(n — 1) (oe ) 


n 


Therefore, 


n 


E(X(X —1)) =) 0 x(e — I) P(X = x) 


=a Os 


GC) 


n 
Fe y OO = 1) mb (#72) 
x=2 n(n—1) n—2 


_ b(b—1)n(n— 1) ye Cs) (1-2-2) 
GENOMES) 


The last sum is again the sum of the probabilities of a hypergeometric distribu- 
tion. The parameters are b — 2, r, and n — 2. The sum is therefore 1. Thus, 


b(b — 1)n(n — 1) 


EOS = (b+r)(b+r—1)' 


To simplify the notation, let VN = b +r. Hence, 


b(b — 1)n(n — 1) 
N(N — 1) 


E(X) = ne and E(X(X — 1)) = 


Note that 


E(X*) = E(X(X — 1)) + E(X). 


86 7 Combinatorics 


Since Var(X) = E(X2) — E(X)?, 


b(b — 1)n(n — 1) Bin cash 


Var(X) = ND Rig wy 
nb 
Sas (No 1)(n — 1) + N(N — 1) — bn(W — )) 
2 pop oa) 
N2(N — 1) 
N-n 
="pq 


where p = b/N andq = (N — b)/N. 

If the sampling is done with replacement, then the number of blue marbles B is 
a binomial with parameters n and p. Hence, E(B) = np and Var(B) = npq. We 
note that E(B) = E(X), but Var(B) > Var(X). 


5 Conditioning on the Number of Successes 


Let X,, X2... be a sequence of independent Bernoulli random variables with 
probability p. That is, P(X; = 1) = p and P(X; = 0) = 1 — p. Note that 
this can be written in a condensed form, 


P(X =x) = p*(1- p)'”, 


where x = 0 or 1. We will use this form in the computation below. 

We can think of X;, X2... as a sequence of bets. If X; = 1, we have won the 
i-th bet, and if X; = 0 we lost the bet. Let B, be the total number of bets we won 
up to the n-th bet. 

Let x1, x2, ..., X», be a sequence for which each x; is 0 or 1. Let 


n 
k= Py Xi: 
i=1 

Then, 

1 

P(X, =%1,..-,Xn =Xn|Bn =k) = Gy. (2) 

(i) 

There are at least two remarkable aspects of this result. 


¢ The conditional probability does not depend on p. 


5 Conditioning on the Number of Successes 87 


¢ Given the number of successes up to time n, the position of these successes is 
completely random. This is so because there are (7) ways to place k 1’s inn 


spots. Formula (2) tells us that all ways are equally likely. 
We now prove formula (2). By definition of conditional probability, 


PX, =x1,..., Xn = Xn} {By = k}) 


P(X, =%1,...,X, = B,=k)= 
(Xx; xX] n Xn|Bn ) PR =) 


Since k = ae x;, the event {X; = xj,..., X, = X,} is included in the event 
{B, = k}. Hence, the intersection of these two events is the smaller event. Since the 
X; are independent, 


P(X, =%,...,Xn =X) =P(X1 = %1)... P(Xn = Xn) 
=p ap) ene eepy. 
=p*(1 — py" 

Using that B, has a binomial distribution with parameters n and p, 


POQH Mio = 
Pete eas oe n= a) 


P(Bn =k) 
__pka = py 
(i) P*d — p)r-* 


This proves formula (2). 
We now use formula (2) to prove, 


(i) (cx) 
n+r ° 
("2") 
This is so because there are (” a ) ways to place @ 1’s inn +r spots. According 
to (2), all these configurations are equally likely. We count the configurations with 


k 1’s in the spots | through n and £ — k 1’s in the spots n + 1 through n + r. There 
are 


P(B, = k|Bn+r == 


(3) 


such configurations. This proves (3). 
Formula (3) tells us that the conditional distribution of B, given B,., = £ is a 
hypergeometric. 


Chapter 8 m®) 
Continuous Random Variables hook for 


1 Probability Densities 


e A continuous random variable takes all values in a given interval of real 
numbers. 


Example I The service time of a television has values in the interval (0, +-0o). That 
is, the television can break at any moment after I bought it (time 0), and there is no 
set upper limit until when it can last. 

A probability density is a function f defined on the real numbers with the 
following properties: 


¢ There is an interval (a, b) called the support of f such that f(x) > 0 if x is in 
(a, b) and f(x) = 0 otherwise. 


©? fQxdx = 1. 


All the probability densities we will consider will be continuous except possibly 
at one or two points. Hence, these functions will always be Riemann integrable. The 
support of the probability density (a, b) may have a = —o or b= +00. 

We will restrict our definition of continuous random variables to those that have 
a probability density. 


¢ Let X be acontinuous random variable. Then, X has a probability density f such 
that for any c < d, 


d 
Pcs xsa)= [ St (x)dx. 


© Springer Nature Switzerland AG 2022 89 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_8 


90 8 Continuous Random Variables 


Note that P(X = c) = f i f(x)dx = O for any real number c. Therefore, 
continuous random variables have a probability 0 of hitting a specific point. In 
particular, 


P(c<X <d)=P(c < X <d)=P(c <X <4). 


Example 2 Let X have density f(x) = cx* for x in (-1,1) and f(x) = 0 elsewhere. 
Find c. 
We must have 


After integrating we get 
c(2/3) = 1 


and therefore c = 3/2. 
What is the probability that X is larger than 1/2? Observe that the support of X 
is (—1, 1); hence, X is always less than 1. Therefore, 


1 1 
P(X >1/2)= ii f (x)dx = i (3/2)x2dx = 7/16. 
1/2 1/2 


What is the probability that X is negative? Since X > —1, 


0 0 
P(X <0)= : f(x)dx = / (3/2)x?dx = ; 
-1 -1 


2 Uniform Random Variables 


We first define a special case that is particularly important in applications. 


e Let X be arandom variable with density f(x) = 1 for x in (0, 1) and f(x) = 0 
elsewhere. The random variable X is said to be uniform on (0,1). 


Next we graph the density of X (Fig. 8.1). 
Note that fi) f(x)dx = fi dx = 1 and that the support of X is (0, 1). 


Example 3 What is the probability for a uniform random variable on (0, 1) to be in 
the interval (1/2,3/4)? 


We have that P(1/2 < X < 3/4) = i f(x)dx = 1/4. 


2 Uniform Random Variables 91 


Fig. 8.1 The density of a Yn 
uniform random variable on 


(0, D 


What is the probability that X is larger than 1/2? Since X < | (why?), 


1 
P(X > 1/2) =a! f(x)dx = 1/2. 
1/2 


More generally, we have the following: 


¢« A continuous random variable X is said to be uniform on the interval (a, b) if 
the density of X is 


for x € (a,b). 


f@) = — 

We defined f only on its support (a, b). Elsewhere f is understood to be 0. This 
will be our general convention. 

Note that the density of a uniform random variable is flat on its support. That 
is why it is called uniform. A uniform random variable is a good model when we 
know the support of the distribution, but there is no reason to assume that one part 
of the support is more likely than another. We give an example next. 


Example 4 1 know that a shuttle to campus runs every 20 min, but I do not have the 
exact schedule. I show up at a random time at the shuttle stop. How long will I have 
to wait? 

Clearly, I will wait less than 20 min. There is no reason to believe any subinterval 
of (0, 20) is more likely than any other subinterval. Hence, to model my waiting time 
I pick a uniform random variable on (0, 20). I denote the uniform random variable 
by U. What is the probability that I will wait less than 5 min? 


1 


1 5 
pu <3)=5, [| du=-—. 
20 Jo 4 


92 8 Continuous Random Variables 
3 Exponential Random Variables 


¢ Leta > 0. A random variable X with probability density f(x) = ae” for 
x > O is said to be an exponential random variable with parameter (or rate) a. 


Next we sketch the graph of an exponential density (Fig. 8.2). 
We now check that this is indeed a probability density. Note that f(x) > 0 on its 
support (0, +00). Recall that an antiderivative of ae~“* is —e~“*. Hence, 


+00 +00 
f@dx = —<**| it 
0 0 


Thus, f is indeed a probability density. 
The exponential distribution models well random phenomena with no wear and 
tear. Next we give an example. 


Example 5 How long will it take for my new windshield to get cracked? 

A windshield does not age. Even if my windshield has not cracked in the past 5 
years, it is no more likely to crack in the next month than it was the first month I 
got it 5 years ago. Therefore, we pick for the time it takes to get the first crack an 
exponential random variable with density f(x) = ae~“. How to pick a? We will 
see that the expected value of an exponential distribution is |/a. In my experience I 
get a crack on average every 6 months. Therefore, I pick a = 1/6. 

What is the probability that the first crack will not happen for at least one year? 

Let X be the time for the first crack. Then, 


+00 


+00 1 
Pes DS i ae dx = —<#/6| —¢2 ~ 0,14, 
12 


12 


What is the probability that the first crack will happen between 6 months and a 
year? 


12 1 : 12 
PH2X <1] / ze ae = P|, aplaog?@ 103. 
6 


Fig. 8.2 The density of an y 
exponential random variable 


7 fle) = a0 


3 Exponential Random Variables 93 
3.1 Memoryless Property 


¢ Let T be an exponential random variable with parameter a. The random variable 
T has the following memoryless property. For any s > 0 and ¢ > 0, 


P(T >t+s|T >s)=P(T >t). 


In words, assume T is the time until a certain event happens (the windshield 
cracks or the computer dies, for instance). Then knowing that the event has not 
happened by time s, the conditional probability that the event will not happen for 
another ¢ units of time is the same as the unconditional probability that the event 
will not happen ¢ units after the initial time. 


We now prove the formula. First, note that for any s > 0 
[o,@) 
P(T>s)= / ae “dt=e™. 
AY 


By definition of the conditional probability, 


PCT >t+s}N{T > s}) 


P(T >t+s|T>s)= PIF =) 


Note if T > t+ s, then T > s. Hence, the intersection of the events {T > t + s} 
and {T > s}is the event {7 > ¢ + s}. Thus, 


P(T >t+s) 


PO Sits\t SHS 
US eel paw 


Since P(T > t +s) = e~@*) and P(T > s) =e~®, 


eT alt+s) 
P(T >t+s|T>s)= =e", 


eas 


Since e~“ = P(T >t), 
P(T >t+s|T>s)=P(T >t). 


This completes the proof of the memoryless property for the exponential distribu- 
tion. 


94 8 Continuous Random Variables 
Problems 


1. Let f(x) = cx(1 — x) for x in [0,1] and f(x) = 0 elsewhere. Find c so that f 
is a probability density function. 

2. Let the graph of the density f be an isosceles triangle for x in (—1, 1). Find f. 

3. Let X be the density of a uniform random variable on (—2, 4). Find the density 
of X. 

4. Let T be the waiting time for a bus. Assume that T has an exponential density 
with rate 3 per hour. 


(a) What is the probability of waiting at least 20 min for the bus? 

(b) Given that we have waited 20 min, what is the probability of waiting an 
additional 20 min for the bus? 

(c) Under which conditions is the exponential model appropriate for this 
problem? 


5. Let T be a waiting time for a bus. Assume that T has a uniform distribution on 
[0,40]. 


(a) What is the probability of waiting at least 20 min for the bus? 
(b) Given that we have waited 20 min, what is the probability of waiting an 
additional 10 min for the bus? 


6. Let Y have a probability density g(y) = cye7” for y > 0. Find c. 

7. Let X have density f(x) = xe~* for x > 0. What is the probability that X is 
larger than 3? 

8. Let T have density g(t) = 41? for t in (0, 1). 


(a) What is the probability that T is between 1/4 and 3/4? 
(b) What is the probability that T is larger than 1/2? 


9, LetO <a <b. 


(a) Show that for any random variable X, 
P(a < X <b)= P(X <b)— P(X <a). 
(b) Assume that the random variable X is continuous and that P(X > s) = 


e~ 2. Use (a) to compute P(a < X <b). 
(c) Find the density of X. 


4 Expected Value 


For many problems it is enough to have a rough idea of the distribution and one tries 
to summarize the distribution by using a few numbers. The most important of these 
numbers is the expectation. 


4 Expected Value 95 


e Assume that X is a continuous random variable with probability density f and 
support (a, b). The expected value (or mean) of X is defined by 


b 
E(X) = / xf (x)dx 


Note that if the support is infinite (a = —oo or b = +00), the integral defining 
E(X) may fail to converge. In such a case the random variable X has no expected 
value. 


Example 6 Let X have probability density f(x) = 2x for x in (0, 1). What is the 
expected value of X? 


1 


1 
Ea = [ xpfldx =2 f x7dx = a 
0 0 3 


The random variable X can be anywhere in (0, 1). The density however is tilted 
toward 1. An expected value of 2/3 gives an idea of where the random variable 
typically is. The expected value can be a good measure of location or not. This will 
depend on how dispersed the distribution is. We will define a measure of dispersion 
in the next section. 


e Let X be uniformly distributed on (a, b). Then the probability density of X is 
f(x) = p4 for x in (a, b); thus 


ree b eee 1 @ a’y ate 
= [xf ee re ae 


That is, the expected value of a uniform on (a, b) is the midpoint between a and 
b. 


¢ Let T be exponentially distributed with rate a. Then the probability density is 
f(t) =ae~ fort > 0. Hence, 


[o,@) [o,@) 
E(T)= i tf(t)dt = / tae “dt. 
0 0 
Integration by parts yields 
fore) oo 1 
E(T) = =e" ] +f ea SE, 
0 0 a 


That is, the expected value of an exponential random variable is the inverse of its 
rate. 


96 8 Continuous Random Variables 
4.1 Symmetric Probability Density 


If the probability density is symmetric, then the expected value is given by the point 
of symmetry. 


e Let f be the density of a continuous random variable X. Assume that there is 
s > O such that 


f(s+x)= f(s—x) 


for every x in the support of f. Then, E(X) = s (Gif E(X) exists!). 


1 
1+x2 


Example 7 Let X have probability density f(x) = 2 
What is the expected value of X? 

Since f(x) = f(—x) for every x in the support of X, the probability density is 
symmetric with respect to 0. Hence, E(X) = 0. 


with support (—1, 1). 


4.2 Function of a Random Variable 


As we will see in the next section, it is often necessary to compute E(X7). This 
is NOT E(X)?. In order to use our definition of expected value we should first 
compute the probability density of X* to get the expected value. Thankfully, there 
is a quicker way to do things, and it is contained in the following formula: 


e Let X be a continuous random variable with density f and support (a, b). Let g 
be a real valued function. For instance, g(x) = x. Then, 


b 
E(g(X)) = i g(x) f(a)dx. 


a 


Example § Let X be uniformly distributed on [0,1]. What is E(X 2) 2 
Let g(x) = x2. We want E(g(X)). Hence, 


1 1 1 
E(g(X)) = / g(x) f(odx = / Pdr =, 
0 0 


Example 9 Let X be uniformly distributed on [0,1]. What is E (eX)? 
Let g(x) = e~*. We have 


1 1 
E(@) = | een fendx = [ e*dx =1—e7!. 
0 0 


Problems 97 
5 The Median 


To summarize the location of a distribution it is often a good idea to use more than 
one number. Besides the mean another important number is the median. 


e The median m of a continuous random variable X is defined by 
P(X <m)=1/2. 


Note that we also have P(X > m) = 1/2. A median gives less weight to the 
extreme values of the distribution than the expected value. Another advantage of the 
median over the expected value is that while the expected value may not exist, the 
median always does. 


Example 10 Let X have probability density f(x) = 2x for x in (0, 1). What is the 
median of X? 
Let m be the median of X. By definition, 


[ f(x)dx = - 


We integrate the l.h.s. and get the equation m” = 1/2. Since m > 0 (why?), we get 


— v2 


Example 11 Let T be an exponential random variable with rate 1. What is its 
median? 
We solve the equation 


[oe 
P(T >m) =y edt =e" = 1/2. 


m 


Thus m = In2. Note that P(T < In2) = 1 — P(T > In2) = 1/2. So In2 is the 
median of this distribution. Note that the median is different from the mean that is 1 
in this case. 


Problems 


10. Let X be exponentially distributed with mean 1/2. 


(a) What is the density of X? 
(b) What is E(X)? 


98 


8 Continuous Random Variables 


Fig. 8.3 Probability density y 
for Problem 12 


11. 


12. 


13. 


14. 


15. 
16. 


17. 


Let U be a random variable that is uniformly distributed on (—1, 2). 


(a) Compute the mean of U. 
(b) What is the median of U? 


Let X have the density in Fig. 8.3. 


(a) What is a? 
(b) Find the expected value of X. 
(c) How good is E(X) as a measure of location of X? 


Let f(x) = 3x? for x in [0,1]. Let X be a random variable with density f. 


(a) What is E(X)? 
(b) What is the median of X? 


Let T be exponentially distributed with rate a. Find the median of T in function 
of a. 

Let X be exponentially distributed with rate 1. What is E(X7)? 

This problem gives an example of a continuous random variable that has no 
expectation. 


(a) Show that f(x) = 4 for x > 1 is a density function. 
(b) Show that a random variable with the density above has no expectation. 
(c) Find the median of this random variable. 


Consider a uniform random variable on (a, b). Use a symmetry argument to 
find the expected value without any computation. 


6 Variance 


The expectation and the median are measures of location. These numbers can be 
thought of “typical” outcomes of the corresponding probability distribution. But 
how good are these numbers at summarizing the distribution? Next we define the 
variance that is a measure of how good the expectation is. A small variance means 
low dispersion of the distribution around the expected value. A large variance means 


6 Variance 99 


high dispersion around the expected value. In the latter case the expected value is 
not very informative about the distribution. 


Let X be a continuous random variable with mean E(X) = jp. The variance of 
X is denoted by Var(X) and is defined by 


Var(X) = El(X — #)"]. 
The following formula is useful for computational purposes 
Var(X) = E(X”) — pr’. 
The standard deviation of X is denoted by $ D(X) and is defined by 


SD(X) = /Var(X). 


Here are two important consequences of these definitions: 


The variance of a random variable is ALWAYS positive or 0. This is so because 
the variance is the expected value of the positive random variable (X — 1)?. 
The variance of a random variable X is 0 if and only if X is a constant. 


We now turn to the computation of variances. 


Example 12 Let X have probability density f(x) = 4x? for x in (0, 1). What is the 
variance of X? 


We need E(X) and E(X2). We have 


1 1 i 4 
Ea = [ xpndx =4 f x"dx = =. 
0 0 5 


We now turn to 


Hence, 


1 1 
E(X°) = x? f(x)dx = 4f xdx = a 
0 0 3 
— £(x2) — B(x)? = 2 OY a 2 
Var(X) = E(X’) — E(X)? = ; ©) ae 


Let X be uniformly distributed on (a, b). Then, 


(b—a) 


Var(X) = — 


100 8 Continuous Random Variables 


We first compute E(X 2). 


Bo%) = [sper = : [ eu= : (b> — a) 
=f ee a earn ies = 36 =a) —a). 


Since b? — a> = (b — a)(b? + ba +a”), 


b? +.ab +a? 


E(X*)= : 


Using that E(X) = (a+ b)/2, 


Var(X) =E(X’) — E(X)* 


_b +abt+a* (a+b) 


3 4 
b* — 2ab + a? 
oe eel 
(b—a)* 
= 


¢ Let T be exponentially distributed with rate a > 0. Then Var(T) = + and 


therefore SD(T) = i. 


Note that E(T) = SD(T); this shows that the exponential distribution is rather 
dispersed. 
We now compute this variance. 


E(T’) = ie 1? f(t)dt = ie fae“ di. 
0 0 


We do an integration by parts to get 


love) lee) 
E(T?) = Pea] +f 2te “dt = af te “dt. 
0 0 


Recall that E(T) = 1/a. Hence, 


oD 1 
/ ate “dt = — 
0 a 


and therefore, 


6 Variance 101 


Fig. 8.4 Probability density y 
for Example 13 


Thus, 


1 
= 


Var(T) = E(T’) — E(T)* = = = - 
a a 


Example 13 Find the variance of Y, a random variable with the density in Fig. 8.4. 
The density of Y is f(y) = y for y in [0,1] and f(y) = 2 — y for y in [1,2]. The 
mean of Y is | because the density is symmetric with respect to x = 1. 
We confirm this by computation. 


2 1 2 
E(Y) = i yf (y)dy = [ y’dy + / y(2 — y)dy. 


Thus, 
1 2 2 
Ew) = 7/3] +] -»*/3] =1- 
We now deal with E(Y2). 
2 1 2: 
EY?) = i, yf (y)dy = [ yedy + i y?(2— y)dy. 
So 


1 2 2 
E(Y) = y*/4] 4299/3] — y*/a] = 7/6. 


Var(Y) = E(Y’) — E(Y)’ =7/6-1=1/6. 


102 8 Continuous Random Variables 
Problems 


18. Let X have density f(x) = x*e~* /2. What is the variance of X? 

19. Let U be a random variable that is uniformly distributed on [—1, 2]. What is 
the variance of U? 

20. Consider the random variables X and Y with densities f(x) = 3x? for x in 


[—1, 1] and g(x) = xe! — x’) for x in[—1, 1], respectively. 


(a) Sketch the graphs of f and g. Based on the graphs, which random variable 
should have the largest variance? 
(b) Compute the variances of X and Y. 


21. Let f(x) = 3x* for x in [0,1]. Let X be a random variable with density f. 
What is the variance of X? 
7 Normal Random Variables 


We start with a particular normal random variable. We will then introduce the 
general case. 


7.1 The Standard Normal 


e A standard normal random variable is a continuous random variable Z with 
density 


2/2 


f@)= ue 


and support (—0o, +00) (Fig. 8.5). 


¢ Let Z be a standard normal random variable. Then, E(Z) = 0 and Var(Z) = 1. 


We will do these computations at the end of the chapter. 


Fig. 8.5 The density of a y 
standard normal random 
variable 


7 Normal Random Variables 103 


Throughout the book we will use the notation Z to designate a standard normal 


random variable. In order to compute probabilities involving Z we need to integrate 


; ; ; Saree sig 
its density. However, there is no closed formula for an antiderivative of ae e/2 
IT 


Hence, we will need to rely on numerical integration. A table is provided in the 
appendix for the probabilities P(O < Z < z) for values of z in [0, 2.99]. In the 
Binomial chapter we explained how to read the normal table. 


7.2 Normal Random Variables 


We will use interchangeably the notations e* and exp(x). 


¢ Anormal random variable with mean jz and standard deviation o is a continuous 
random variable Z with density 


fl 1 . 
f(s) = —— exp ( spi uw?) 


and support (—0o, +00). 


Note that for the definition above to make sense there are several things to be 
checked. We need f to be a density for all 4 and o > 0. We also need to show that 
E(X) = wand that Var(X) = o. We will outline these proofs in the problems. 


e A normal random variable is standard if ~ = 0 ando = 1. 


Below are the densities of two normal random variables with uw = 1. One has 
a standard deviation equal to 1 and the other one a standard deviation equal to 
1/2. They both have the characteristic bell shaped form. However, one can see how 
much more spread out the curve with o = | is compared to the one with o = 1/2 
(Fig. 8.6). 

The following is a very important property of normal distributions: 


e If X is anormal random variable with mean ju and standard deviation o,, then the 
random variable 


X—p 


oO 


is a standard normal random variable. 


Fig. 8.6 The densities of two y 
normal random variables with 
mean | with standard 
deviations 1 and 1/2, 
respectively 


104 8 Continuous Random Variables 


What is remarkable is not that xu has mean 0 and standard deviation 1. This 
is true for any random variable that has a mean and a standard deviation. What is 
remarkable is that after shifting and scaling a normal random variable we still get a 
normal random variable. 

Thanks to the property above, any question about a normal variable can be 
changed into a question about a standard normal. 


Example 14 Assume that heights of 6 years old are normally distributed with mean 
100 cm and standard deviation 2 cm. What is the probability that a 6 years old taken 
at random is at least 105 cm tall? 

Let X be the height of the child picked at random. We want P(X > 105). We 
standardize X to get 


—100 105-1 


x 00 
P(X > 105) = P(— — > —5—) = P(Z > 2.5) ~ 0.01. 


So there is only a 1% probability that a child taken at random be at least 105 cm tall. 


Example 15 What is the height above which 90% of the 6 years old are? 
We want h such that P(X > h) = 0.9. We standardize X again to get 


X—100 h—100 h — 100 
P(X > h)= P( 5 > 5 )=P(Z> 5 


)=0.9. 


Note that ae must be negative. By symmetry of the distribution of Z we have 
that 


h — 100 —h + 100 


5 = PZ < —S—) 509. 


P(Z> 


So according to the normal table, we have 


—h + 100 
2 


= 1.28. 


We solve for / and get that h is approximately 97.44 cm. 


Example 16 Let X be normally distributed with mean yw and standard deviation o. 
What is the probability that X is 20 within its mean? 
We want P(|X — | < 20). We standardize X to get 


> 
P(\X —p| <20) = paw <2) = P(|Z| <2) =2P(0 < Z <2) ~ 0.95. 
Oo 


7 Normal Random Variables 105 
7.3 Applications of Normal Random Variables 


Normal random variables are appropriate to model a number of biological, physical, 
and sociological phenomena. The probability that a normal observation is 20 away 
from its mean is about 0.05, and the probability that it is 30 away from its mean 
is 0.002. Hence, a normal random variable is not appropriate to model phenomena 
with long tails, that is phenomena that have a significant probability of being far 
away from the mean value of the distribution. Examples of random variables with 
long tails are times between successive earthquakes, times between successive stock 
market crashes or between successive pandemics. 


7.4 Expectation and Variance of a Standard Normal 


We now show that E(Z) = 0 and Var(Z) = 1, where Z is a standard normal 
variable. 

Since the density of Z is an even function, it is symmetric with respect to 0. 
Hence, E(Z) = 0. 

To compute E (Z*) we first note that 


+ 


[oe 
Ve dy = pad: Pe dy, 


1 hee 
J 2 te V2 JO 


2 
2p-x?/2 


E(Z’) = 


This is so because x is an even function. Next we do an integration by parts, 


EO) a [eax 
V20 J0 


+: 
ape. (-xe"7]™ +f Zz eas) ; 
V20 0 0 


2 2 “s < . x2 
where we wrote x7e~*/? as x(xe~* /?) and we took the antiderivative of xe~* /? 


and the derivative of x. Since 
2/9] 
—xe * ak =0 


we get 


106 


Using now that f(x) = Tee 


8 Continuous Random Variables 


—/2ig a probability density, 


+ 
i i eae g, =1. 


Since f is an even function, 


Hence, 


E(Z2) =2 / ieee) ee 
= —e x= 1. 
0 J 20 


Therefore, Var(Z) = E(Z2) — E(Z)? = 1. 


Problems 


22. 


23. 


24. 


25. 


Assume that X is normally distributed with mean 3 and standard deviation 2. 


(a) P(X > 3) =? 

(b) P(X > —-1) =? 

(c) P(-1 < X <3)=? 
(d) P(|X —2| <1) =? 


Assume that the diameter of a ball bearing is normally distributed with mean 1 
cm and standard deviation 0.05 cm. A ball bearing is considered defective if its 
diameter is larger than 1.1 cm or smaller than 0.9 cm. 


(a) What is the proportion of defective ball bearings? 
(b) Find the diameter above which 99% of the diameters are. 


Assume that X is normally distributed with mean 5 and standard deviation o. 
Find o so that P(X > 4) = 0.95. 

Assume that the annual snow fall at some place is normally distributed with 
mean 20 in. and standard deviation 8 in. 


(a) What is the probability that the snow fall be less than 5 in. on a given year? 
(b) What is the probability that the smallest annual snow fall in the next 20 
years will be less than 5 in? 


More Problems for Chap. 8 107 


26. 


27. 


28. 


Let Z be a standard normal random variable with density 


f(g) = ee? 
A= Jam : 


In this exercise we will check that f is actually a density. 


(a) Change the variables from Cartesian to polar to show that 


+00 ptoo eer 2 
i / oP gay -|/ / e? | odpdd. 
—oo J—oo 0 0 


(b) Show that the r.h.s. of (a) is 277. 
(c) Show that the |.h.s. of (a) is 


+oo 
( i en 2qx)?, 
—c 


(d) Conclude that f is a density. 
Let X be a normal random variable with mean yz and standard deviation o. 


(a) Show that the density of X is symmetric with respect to yz. That is, show 
that for all x, f(u +x) = f(u—x). 

(b) Use (a) to show that E(X) is indeed ju. 

(c) Compute Var(X). 


Let 


f(x) = \ eB /2 


J 210 


Show that f has inflection points at 4+ o and u—o. 


More Problems for Chap. 8 


29. 


30. 


Assume that car batteries lifetimes follow an exponential distribution with mean 
3 years. 


(a) What is the probability that a battery lasts 10 years or more? 
(b) In a group of 10 batteries what is the probability that at least one will last 
10 years or more? 


Let X a random variable with density f(x) = ce7!. 


(a) Find c. 
(b) What is the P(X > 1)? 


108 


31. 


32. 


33. 


34. 


35. 


8 Continuous Random Variables 


Let X have density g(x) = c(x — 1)? for x in [0,2]. 


(a) Find c. 
(b) Find E(X). 
(c) Find Var(X). 


It is believed that in the 1700s in Europe life expectancy at birth was only 
around 40 years. That is, a newborn baby could expect on average to live 40 
years. It is also known that child mortality was extremely high. Maybe, as many 
as 50% of all babies did not make it to their fifth birthday. 


(a) Compare the median life span to the expected life span. 
(b) Were people old at 35? 


My new tires are expected to last around 40,000 miles with a standard deviation 
of 5000 miles. Should I pick a uniform, a normal, or an exponential random 
variable to model this mileage? Explain your choice and specify the parameters 
of the distribution you pick. 

The National Weather Service issues a thunderstorm warning between 4:55 and 
5:20. What distribution should I use to model the time of the storm? 
Damaging hail storms happen in my area on average every 7 years with a 
standard deviation of 3 years. What distribution should I use to model the time 
of the next damaging hail? 


Chapter 9 m®) 
The Sample Average and Variance sei 


1 The Sample Average 


Let n > | be a natural number. Let X, X2,...X, be the independent identically 
distributed (1.1.d. in short) random variables. That is, these n random variables have 
the same distribution and are independent. 

Consider for instance a sample of 100 two year old girls selected at random. We 
use this sample to study the height of a girl at two years, X;, X2, ... X109 are the 
100 heights. The sample size is n = 100. A reasonable model for the height at two 
years is a normal distribution with mean jz and variance o*. To find jz and o? we 
would need to measure every two years old girl in the population. This is not going 
to happen. The parameters jz and o? are going to remain unknown. We will estimate 
these parameters by using the observations X 1, X2,..., Xn. 


¢ A statistic is any function that can be computed using (X1,..., Xy) only. 
A very important statistic is the sample average defined below. 


¢ The sample average is defined as 


ast si 
X= —(X1 + X2 +--+ Xn). 


The sample average will be used to estimate the expected value ju of the 
distribution. Going back to our example of the height of a two year old girl, we 
will never know the true value of 7 = E(X), but we can estimate jz by using the 
sample average X. Note that jz is a (unknown) constant, but X is random. For every 
new sample we get anew X. 


e Let (Xj,..., Xn) bei.i.d. with expected value jz. Then, E(X) = LL. 


Because E(X) = 1, X is said to be an unbiased statistic for 2. 


© Springer Nature Switzerland AG 2022 109 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_9 


110 9 The Sample Average and Variance 


We now compute E(X). Recall first that the expectation is a linear operator. That 
is, for any two random variables X and Y, 


E(X + Y)= E(X)+ E(Y), 
and for any real number a, 
E(aX) =aE(X). 


By the linearity of the expected value, we get 
= 1 
E(X)=E kit X24 + Xn) 
1 
= ~E(X1+Xo+--° + Xn) 


1 
=— (E(X1) + E(X2) +--+» + E(Xn)) 


1 
= nL 
n 


We now know that X is an unbiased statistic for jw. The next step is to find out 
how dispersed X is around jz. This is why we compute the variance of X. 


¢ Let (X1,..., Xn) bei.i.d. with expected value jz and variance o”. Then, 
—_— o2 
Var(xX) = —. 
n 


We will compute Var(X) below. Observe first that by the definition of the 
variance 


2 
= ae 4 o 
Var(X) = E(®- i ) ete 
n 
In particular, as n goes to infinity, the dispersion of X around ju goes to 0! Hence, 
X converges to jz. This confirms that X is a good statistic to estimate jw. 
* Law of Large Numbers As the sample size n goes to infinity, X converges to ju. 


The convergence above is in the sense that Var(X) converges to 0. There are 
several other types of convergence for the convergence of X to w. These are seen in 
more advanced probability texts. 


2 The Central Limit Theorem 111 


We now compute Var(X). First recall that if X and Y are two independent 
random variables, then 


Var(X + Y) = Var(X) + Var(Y). 
If a is a real number, then 
Var(ax) = a’Var(X). 


Using these two properties and the fact that X;, X2,..., X» are independent, 


= 1 
Var(X) = Var (Sox + Xa-to--+Xy)) 
n 


1 
= Var (X1+X2+-+-+Xn) 
n 


= < (Var(X1) + Var(X2) +---+ Var(Xn)) 


This completes the computation of the variance of X. 


2 The Central Limit Theorem 


Consider (Xj, ..., X») ani.i.d. sequence of random variables with expected value 
and variance o”. In the last section we saw that as the sample size n goes to infinity, 
the sample average X converges to . The following fundamental result gives the 
typical deviations of X with respect to ju. 


¢ Central Limit Theorem Let (X,,..., X,,) be i.i.d. with expected value ju and 
variance a”. Then, the distribution of 
Kap 


oO 


Jn 


converges to a standard normal distribution as n goes to infinity. 


112 9 The Sample Average and Variance 


Note that the standard deviation S D(X) of X is o/,/n and E(X) = mw. Hence, 


That is, the Central Limit Theorem (CLT in short) states that the scaled and centered 
X converges to a standard normal distribution. What is remarkable is that the 
limiting distribution is always normal. It does not matter what the initial distribution 
is (provided this distribution has a mean and a variance). 

We illustrate the CLT for uniform random variables in Figs. 9.1 and 9.2. 


¢ The CLT tells us that the approximate distribution of the average of n i.i.d. 
random variables is normal when n is large. However, it does NOT say that every 
distribution is normal! If the sample size n is not large enough, we cannot use the 
CLT. 


Example I Assume that we have 25 batteries whose lifetimes are exponentially 
distributed with mean 2h. If the batteries are used one at the time, with a failed 
battery replaced immediately by a new one, what is the probability that after 60h 
there is still a working battery? 

Let X; be the lifetime of the ith battery fori = 1...25. We want to compute 


P(X, +---+ X25 > 60). 


1200 r 


1000 


800 


600 


400 


200 


0 01 #02 03 04 O05 O06 O07 O8 O09 a 


Fig. 9.1 This histogram was obtained by simulating 10° uniform random variables. As expected 
we get a flat histogram 


2 The Central Limit Theorem 113 


250 


200 


150 


100 


50 


0 
0.49 0.492 0.494 0.496 0.498 0.5 0.502 0.504 0.506 0.508 0.51 


Fig. 9.2. This histogram was obtained by simulating sample averages of 10° uniform random 
variables. We ran 10° such averages. As predicted by the CLT we get a bell shaped histogram 


The distribution of a sum of exponentially distributed random variables is not 
exponentially distributed. To solve this question it is easier to use the CLT rather 
than use the exact distribution of the sum. The CLT applies since we have an i.i.d. 
sequence of random variables. Recall that for an exponential random variable the 
mean and the standard deviation are equal. In this case, 4. = 0 = 2. According to 
the CLT the distribution of 


converges to a standard normal Z. Hence, 


— 60 
P(X, +--+ + X25 > 60) =P(X > 75) 


X- 24— 
—pijag wos 
(on lon 


~P(Z>1)~ 0.16. 


So there is a probability of around 16% that the batteries will last 60h or more. 


114 9 The Sample Average and Variance 


An important application of the CLT is the following: 
° Let (X1,..., Xn) be ii.d. Bernoulli random variables with parameter p. Denote 
X by p. Then, as n goes to infinity, the distribution of 
P-P 
Vp — p/n 
converges to a standard normal distribution. 


Recall that a Bernoulli distribution has mean p and variance p(1 — p). Hence, 
the statement above is just the CLT in the particular case of a Bernoulli distribution. 


3 The Sample Variance 


In many situations we have to estimate the standard deviation o of the underlying 
distribution. 


e Let (X1,..., X,) be ii.d. with expected value w and variance o”. The sample 


variance is defined by 


‘cee = 
xee =e) 
n—1l ra 


S= 


The statistic $* is an unbiased estimator of 0”. That is, E(S*) = 0°. 


We first establish a computational formula for S*. We expand the square below 
to get 


n n n n 
Si(%) — X)? = 0X? - 290 K+ OH. 
i=1 i=l i=l i=l 


We use the cumbersome notation (X)? to indicate that we take the square of the 
average, NOT the average of the squares. Using that )*/_, X; = nX, 


n n n 
YS \(%i — X)? = SOX? — 2n(X)? + 0X)? = 0X? — 0X)’. 
i=1 i=1 i=1 
Thus, we get the following computational formula for $7, 
2 lL Oy nH? 
Sa x x? — —_(X)?. 


n—1 n—1 


i= 


Problems 115 


We use the preceding formula to compute the expected value of S*. First, recall that 
for any random variable Y (that has a variance) 


Var(Y) = E(Y*) — E(Y)*. 
Hence, 
E(Y*) = Var(Y) + E(Y)’. (1) 


Going back to 


and taking expectations on both sides, 
1 7 n 
E(S*) = —— 9° E(X}) - ——E((X)’). 
(S?) says (X?) — — E(X)”) 


Using formula (1) for Y = X;, 


E(X;) = Var(Xj) + E(Xiy? = 07 +. 


Using formula (1) for Y = X, 
2 
2 +> y\2_. 7 2 
E((X)*) = Var(X)+ E(X)* = an +p’. 
Hence, 


E(S’) = 


2 
n Oo 

n(o? + a?) — ——(— +p’) = 0’. 
n—1l n—lion 


This shows that $2 is an unbiased estimator of o7. 


Problems 


1. Consider the following 10 i.i.d. observations of an unknown distribution. 
0.382, 0.101, 0.596, 0.885, 0.899, 0.958, 0.014, 0.407, 0.863, 0.139. 


(a) Compute the sample average X. 
(b) Compute the sample standard deviation S. 


116 9 The Sample Average and Variance 


2. A small airplane can take off with a maximum of 2000kg. Assume that 
passengers have a mean weight of 70 kg with a SD of 15 kg. 


(a) What is the probability that 25 passengers will overload the plane? 
(b) Find the maximum number of passengers that will not overload the plane 
with probability 0.99. 


3. Assume that first graders have a mean height of 100 cm with an SD of 8 cm. 


(a) What is the probability that the average height in a class of 30 is over 
105 cm? 

(b) What is the probability that at least one child is taller than 105 cm high? 

(c) What assumption did you make to answer (b)? 


4. A bank teller takes a mean of 2 min with a standard deviation of 30s to serve a 
client. Assuming that there is at least one client waiting at all times, what is the 
probability that the teller will serve 25 clients or more in one hour? 

5. The mean grade a professor hands out is 80 with a standard deviation of 10. 
What is the probability that in a class of 50 the average grade is below 75? 

6. Recall that if B has a binomial distribution with parameters n and p, then 


B-—np 
vnp(l — p) 


can be approximated by a standard normal distribution. Explain why this is a 
special case of the CLT. 
7. Let X,...X,» be ani.i.d. sequence with variance a”. Consider the statistic 


i Al 
Pa. 2G = 
a 


(a) Show that 


n—-1 5 
Be 


(b) Is T an unbiased estimator of 02? 


4 Monte Carlo Integration 


We apply the Law of Large Numbers to introduce a numerical integration method 
called Monte Carlo integration. Let g be a continuous function on [0,1]. Let 
U,, U2,..., Un be a sequence of i.i.d. uniform random variables on [0,1]. Then, 
g(U}), g(U2),..., g(Un) is also a sequence of i.i.d. random variables. By the Law 


Problem 117 


of Large Numbers, 


jiny SCD Oy Gay). 


n—>0o n 


Recall that if U; has density f with support (a, b), then 


b 
E(g(U))) = / PO Carrs 


a 


Since we are dealing with uniform random variables, f(x) = 1 for x in [0,1] and 
Ff (x) = 0 otherwise. Thus, 


1 
E@uy) =f g(x)dx. 


Hence, 


U we U, 1 

im SU +. + 8Un) =f aie 
noo n 0 

In words, the average 


g(U}) +--+ 8(Un) 
n 


approaches the integral fo g(x)dx as n goes to infinity. 
For instance, let g(x) = x and use 10 random numbers, 0.382, 0.101, 0.596, 
0.885, 0.899, 0.958, 0.014, 0.407, 0.863, 0.139. Then, 


SCOT) ee en). Di ees Une <9 55 
n 


n 


Thus, 0.52 is the approximation we get for the integral 


1 1 
/ g(x)dx = if xdx = 0.5. 
0 0 


Problem 


1. (a) Use Monte Carlo integration to estimate 


(b) Use the normal table to check the accuracy of the estimate in (a). 


Chapter 10 m®) 
Estimating and Testing Proportions spooks 


1. Testing a Proportion 


We will illustrate the hypothesis testing method on an example. 


Example I A manufacturer claims that he produces strictly less than 5% defective 
items. A sample of 100 items is taken at random and 4 are found to be defective. 
Test the claim of the manufacturer. 

We denote the true proportion of defective items by p. The claim of the 
manufacturer is that p < 0.05. We want to test whether this claim holds based 
on the observations. That is, if the proportion of defective items in the sample is low 
enough then there will be enough evidence for the manufacturer’s claim. 

The manufacturer’s claim is called the alternative hypothesis and it is denoted 
by H,. The negation of this claim is called the null hypothesis and is denoted by 
Ho. So the test we would like to perform is 


Ho: p = 0.05 
Hy: p < 0.05 
It is convenient to have an equality for the null hypothesis. It turns out that it 
is always possible to replace an inequality by an equality in the null hypothesis 
without changing the test. The reason is a little involved so we will omit it but we 
will actually test 
Ho: p = 0.05 
Hy: p < 0.05 


There are only two possible decisions. We either reject Ho (i.e. the evidence 
supports the manufacturer’s claim) or we do not reject Ho (i.e. there is not enough 


© Springer Nature Switzerland AG 2022 119 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_10 


120 10 Estimating and Testing Proportions 


evidence to support the manufacturer’s claim). In order to make a decision we 
compute the following probability. Given that the true defective proportion is 5% 
what is the probability to see 4% or less in a sample of size 100? This probability is 
called the P-value and is 


P(p < 0.04|p = 0.05). 
In order to compute this P-value we use the Central Limit Theorem. It states, 


P-p 


Vp — p/n 
converges to a standard normal distribution as n — oo. In this chapter n will always 
be large enough to use the CLT. Hence, 
P =P(p < 0.04|p = 0.05) 


p — 0.05 0.04 — 0.05 
< 
/0.05(0.95)/n —-./0.05(0.95)/n 


=P(Z < —0.46) ~ 0.32, 


gc 


where we used the normal table to get P(Z < —0.46). The P-value tells us that if 
the true proportion of defective items is p = 0.05 there is a 32% chance of observing 
4% or less defective items in the sample. This is a high probability. Therefore, we 
do not reject Hp. We conclude that there is not enough evidence to support the claim 
of the manufacturer. 

We now summarize the method for testing a proportion. 

Assume we have a sample of n independent observations with the same 
probability p of having a certain property. Let the proportion of the sample that 
has the property be p. Let po be a fixed number in (0, 1) and 


P— Po 


J poU — po)/n 


e For the test 


Ho: p = po 
Hy: p> Po 


the P value is P = P(Z > Zz). 


1 Testing a Proportion 121 


e For the test 


Ho : p = Po 
Hy: Pp < po 
the P value is P = P(Z < Zz). 
e For the test 
Ho: p = Po 
Ha : p # Po 


the P value is P = 2P(Z < z)ifz < 0. The P value is P= 2P(Z > z)ifz > 0. 


¢ The P-value of a test is the probability that given Hp : p = po the sample 
proportion is as far or further away from po than the observed /p in the present 
sample. 

¢ In general, when P < 0.01 we reject Ho and when P > 0.1 we do not reject Ho. 

¢ When 0.01 < P < 0.1 the decision becomes more arbitrary. We give ourselves 
a Significance level a (usually 5%). If P < a we reject Ho. If P > a we do not 
reject Ho. 


See Figs. 10.1, 10.2, and 10.3 for the three different tests. 


Example 2 Candidate A claims that more than 10% of the voters are in his favor. In 
a poll candidate A got 121 votes in a random sample of 1,000. Test the claim of A. 


Fig. 10.1 The shaded area in 
the figure above is the P-value 
for a test where H, : p < po 


Fig. 10.2. The shaded area in T 
the figure above is the P-value 
for a test where Hy : p > po 


Fig. 10.3. The sum of the 
shaded areas in the figure 
above is the P-value for a test 


where H, : p # po ( } 


122 10 Estimating and Testing Proportions 


Let p be the proportion of voters in favor of candidate A. The alternative 
hypothesis should be p > 0.1 since this is the claim we want to test. Therefore, 
the test is 


Ho: p=0.1 
Ha: p> 0.1 
Since the alternative hypothesis is p > 0.1 we will reject the null hypothesis if 


the sample proportion p is large. In this example, pp = 0.1, 6 = 121/1000 and 
n = 1000. Hence, 


P-po  _ ——-O.121-O.1 
Vpol — po)/n — J0.1(0.9)/(1, 000) 


The P-value is 
P=P(Z>z)= P(Z > 2.21) = 0.01. 


Since P < 0.05, at the level 5% we reject Ho. That is, at the 5% level, there is 
statistical evidence supporting the claim of candidate A. 


Example 3 A treatment for a certain condition is known to cure 80% of the patients. 
A new treatment is proposed and we would like to compare it to the existing 
treatment. In a sample of 100 patients 92 are cured by the new treatment. Is there a 
significant difference between cure rates between the two treatments? 

In this example we are not given a specific claim to test. We are told to compare 
the two treatments. In a case like this we do a two-tailed test. That is, 


Ho: p =0.8 
Ha: p #08 
In this example po = 0.8, p = 92/100 and n = 100. Thus, 


P— Po 0.92 — 0.8 
L= — = 
Jpo— po)/n J0.8(1 — 0.8)/100 
The P-value is 2P(Z > 3) which is approximately 0.003. The P-value is quite 


small. We reject the null hypothesis at the 0.01 level. There is strong evidence that 
the cure rates are different. 


2 Confidence Interval for a Proportion 123 
2 Confidence Interval for a Proportion 


In Example 2 we rejected the null hypothesis. That is, there is evidence that p > 0.1. 
Can we be more precise? Where is p? This is what we answer next. 

For a sample large enough we have the following formula for a confidence 
interval for a proportion. 


e Assume we have a sample of n independent observations with the same 
probability p of having a certain property. Let the proportion of the sample that 
has the property be p. Let a be in (0, 1) and Z be a standard normal. Let zg be 
such that 


P(\|Z| < zq) =a. 


| PU — p) 
= Zay} > 
n 


then (p — c, p +c) is a confidence interval for p with confidence a. 


Let 


We will also say that the confidence interval is at the level a. The formula above 
is a consequence of the Central Limit Theorem. Note that to compute the variance 
in a hypothesis test we use po(1 — po)/n. This is so because the P value is computed 
assuming that p = po. In contrast when we estimate the variance in a confidence 
interval p is unknown. This is why we use p(1 — p)/n. 


Example 4 We revisit Example 2. In a poll candidate A got 121 votes in a random 
sample of 1000. Can we find an interval for the true proportion p of voters in favor 
of A? 

First we give ourselves a confidence level, usually 90%, 95% or 99%. In this 
example we pick a confidence of 95%. Our objective is to find c so that p is in the 
interval (p — c, p + c) with probability 95%. 

Note that 


P(\|Z| < zq) =2P(0 < Z < Zq). 
Since we want P(|Z| < Zq) = 0.95 then P(O < Z < Zq) = 0.475. We look for the 


probability in the normal table which is closest to 0.475. We find z, = 1.96. In this 
example we have p = 121/1000 and n = 1000. Hence, 


aT 
bz S008 
n 


124 10 Estimating and Testing Proportions 


Therefore, we may say that with confidence 0.95 the population proportion is in the 
interval 


(p —c, Pp +c) = (0.12 — 0.02, 0.12 + 0.02) = (0.10, 0.14). 


One interpretation for the confidence interval above is the following. If we take 
many samples of 100 voters then 95% of the confidence intervals will contain p. Of 
course, we may be unlucky and draw a sample that yields an interval not containing 
p. This will happen 5% of the time. 


Example 5 Find a 99% confidence interval for the proportion in Example 2. 
The only difference with Example 3 is the level of confidence. According to the 
normal table 
P(|Z| < 2.57) = .99. 


Hence, 


pil — Pp) 
ss 


c = 2.57 


Numerically, we get c = 0.0265. Therefore, we may say that with confidence 0.99 
the population proportion is in the interval 


(p —c, p +c) = (0.121 — 0.0265, 0.121 — 0.0265) = (0.09, 0.15). 


Note that at the level 0.99 we get a larger confidence interval. We increased the 
confidence (from 95% to 99%) but we decreased the precision (we got a larger 
interval). The only way to increase the confidence without decreasing the precision 
is to increase the sample size. 


Example 6 How large should a random sample be to get an estimate of the 
population proportion within 0.01 with confidence 0.95? 

We want to know how large n should be in order to get c = 0.01. We need to 
solve in n the following equation 


i p(l= p) 
C = Zq,/ ————_.. 
n 


For a = 0.95 we get zq = 1.96. A little algebra yields 


n = (“)*p(1 — p). 
Cc 


Problems 125 


However, we do not know p except that p is in (0,1). Note that the function g(p) = 
pC — p) has a maximum for p = 1/2. Thus, 


p( — p) < 1/4 forall p in (0, 1). 


Hence, 
Za 21 
<(—)F=- 
(a Cot 


Numerically, n < 9, 604. That is, in order to get a precision of 0.01 with confidence 
0.95 we need a sample of the order of 10,000. This estimate is based on a worst 
case scenario. We did the computation assuming p = 0.5. If we assume p = 0.1 
for instance then n = 3457, about three times smaller! In practice we use p = 0.5 
when we have no idea what to expect for p. 


Problems 


1. Candidate B claims that she has more than 50% support. A random sample of 
25 individuals is taken. A 1 means that the individual supports B, a 0 means the 
individual does not support B. Here are the results. 

OODOOIITIIOLLOLOOLILIOLOIOOIIL 


(a) Compute the sample proportion p in favor of B. 
(b) Test the claim of B at the 5% level. 
(c) Find a 90% confidence interval for p. 


2. The manufacturer claims that less than 5% of the items it manufactured are 
defective. Assume that in a random sample of 1000 items 40 are defective. 


(a) Test the claim of the manufacturer at the level 10%. 
(b) Find a 95% confidence interval for the proportion of defective items. 


3. A pharmaceutical company claims that its new drug is more efficient than the 
existing one that cures about 70% of the cases treated. In a random sample of 
96 patients 81 were cured by the new drug. 


(a) Test the claim of the pharmaceutical company 
(b) Find a 99% confidence interval for the cure rate with the new drug. 


4. We want to find out whether a coin is fair. The coin is tossed 24,000 times. We 
get 12,012 heads. 


(a) Test whether this is a fair coin. 
(b) Find a confidence at the level 0.99 for the probability of heads. 


126 


5. 


10. 


3 


10 Estimating and Testing Proportions 


A roulette has 38 slots and 18 red slots. 


(a) What is the probability of a red slot for a well balanced roulette wheel? 
(b) Test whether whether the roulette is fair. Here are the results of 30 spins of 
the roulette. A | is red, a 0 is black or green. 
1OOLIOLIOOLILLILIIOLLIOLOLOOLIOILO 
(b) Find a 90% confidence interval for the probability of red. 


. A poll institute claims that its estimate of a proportion is within 0.02 of the true 


value with confidence 0.95. How large must the sample be? 


. A poll institute has interviewed 1000 people and gives an estimate of a 


proportion within 0.01. What is the confidence of this estimate? 


. A researcher claims that more than 10% of Ponderosa pines attacked by a 


certain type of beetle die. We select at random 250 infested Ponderosa pines. 
Of those 34 died. 


(a) Test the claim of the researcher. 
(b) Find a confidence interval at the level 0.9 for the proportion of trees that 
die when attacked by this type of beetle. 


. We want to find out whether a coin is fair. The coin is tossed 12 times and 11 


tails are observed. Test whether the coin is fair. (The sample is too small to use 
the CLT but you may use the binomial distribution). 
For a random sample of size n the confidence interval for p is (p — c, p +c) 


where 
|p — p) 
c = Zq,/ ———_,, 
n 


a is the confidence and Z, is such that P(|Z| < zq) =a. 


(a) Explain why we would like c small and a large. 
(b) Ifn increases show that c decreases. 
(c) Ifa increases what happens to zq? to c? 


Testing Two Proportions 


Example 7 Test whether candidate A has more support in Colorado Springs than 
in Boulder. In a poll of 1000 voters candidate A got 42% of the votes in Colorado 
Springs. In Boulder A got 39% of the votes in a poll of 500 voters. 


Let p; and p2 be respectively the true proportions of voters in favor of A in 


Colorado Springs and in Boulder. We want to test whether p; > p2. So we want to 
perform the test 


Ao: pi = p2 


3 Testing Two Proportions 127 


Ha: pi > p2 


Note that this test can actually be expressed as a one parameter test with parameter 
Pi — p2 by writing it as 


Ho: pi — p2 =9 
Hq: pi — p2 > 0 
The parameter p; — p2 is estimated by p, — po. In this example p, = 0.42, 
p2 = 0.39 and therefore p; — po = 0.42 — 0.39 = 0.03. 
The P-value for this test is 
P = P(p\ — pr > 0.03|p1 — p2 = 0). 
By the linearity of the expectation, 
E(pi — p2) = pi — pr. 
Since the two samples are independent, 


Var (pi — pr) =Var(p,) + Var(p2) 


1 1 
=—pi(l— pi) + —p2(1 — po). 
n\ n2 
If n, and nz are large the CLT applies. That is, 


Pi — p2— (pi — p2) 
V pil — pi)/mi + po — p2)/n2 


converges in distribution to a standard normal. Thus, 


0.03 ) 
J pil — pi)/n + pol — p2)/n2 


P=P(Z> 


We still need to estimate p; and p2 to compute P. The P-value is always computed 
assuming Hp. In this case, Ho: pj = p2. This is why we use the pooled estimate 


ni pi +n2p2 


oe nj +n2 
for both p; and p2. Thus, 


0.03 
J pC — p)(/ny + 1/nz) 


P=P(Z> 


128 10 Estimating and Testing Proportions 


Numerically, p = 0.41 and P = P(Z > 1.11) = 0.13. At the 5% (or 10%) level we 
do not reject Hp. There is not enough evidence to claim that the support of candidate 
A is larger in Colorado Springs than in Boulder. 

We now summarize the method for testing two proportions. 

Assume we have two independent random samples of size n; and nz from two 
distinct populations. Let p; and p2 be the true proportions of a certain property in 
populations | and 2. Let p; and pz be the corresponding sample proportions. Let 


Pi — p2 
L= = = 5 
J pC — p/n + 1/n2) 


where 


ni pi +n2p2 


ee nj +n2 
e For the test 
Ho : pi = p2 
Ha: pi < p2 
the P value is P(Z < z). 
e For the test 
Ao: pi = p2 
Ha: pi > p2 
the P value is P(Z > z). 
e For the test 
Ho : pi = p2 
Ha: pi # p2 


the P value is 2P(Z < z) if z < 0. The P value is 2P(Z > z) ifz > 0. 


4 Confidence Interval for Two Proportions 


We have the following formula for a confidence interval for the difference of two 
proportions. 


Problems 129 


e Assume we have two independent random samples of size n; and n2 from two 
distinct populations. Let p; and p2 be the true proportions of a certain property 
in populations | and 2. Let p; and pz be the corresponding sample proportions. 
Let the corresponding sample proportions be p; and jp. Let a be in (0, 1) and Z 
be a standard normal. Let z, be such that 


P(|Z| < 2a) =a. 


Let 


je pi). po — py) 
C4 + A 


ny n2 


then (p; — po — c, pj — p2 +c) is a confidence interval for pj — p2 with 
confidence a. 


Example § Ina political poll of 100 randomly selected voters, 35 expressed support 
for initiative A in Boulder. In Colorado Springs in a poll of 200 randomly selected 
voters, 50 expressed support for initiative A. Find a confidence interval, with 
confidence 0.9, for the difference between the proportions of supporters of initiative 
A in Boulder and in Colorado Springs. 

In this example, 2; = 100, nz = 200, p; = 35/100 and py = 50/200. Since 


P(|Z| < Zzq) =2P(0 < Z « Zag), 


Zq = 1.64 for a = 0.9. Hence, 


te D ee 
c= 14] Pi) , Pt P2) _ 999. 


n\ n2 
At the level 90% the confidence interval for p; — p2 is 


(Pi — p2 —c, pi — p2 +c) = (0.01, 0.19). 


Problems 


11. We are interested in testing whether people in country A are more favorable 
to mandatory retirement at age 70 than in country B. A random sample of 30 
people in country A gave the following result (1 is for, 0 is against) 

ODODODODOIOLIOLOLIOIOLILILLIOLOIOIILIO 
A random sample of 35 people in country B gave the following result 
11110111100011000001110011000100000 


130 


12. 


13. 


14. 


15. 


10 Estimating and Testing Proportions 


(a) Test whether people in country A are more favorable to mandatory 
retirement at age 70 than in country B. 

(b) Find a 95% confidence interval for the difference in proportions between 
countries A and B. 


We want to test whether drug A is less effective than drug B. In two independent 
random samples, Drug A was given to 31 patients and 25 recovered. Drug B was 
given to 42 patients and 36 recovered. 


(a) Test the claim that drug A is more effective than drug B. 
(b) Find a 90% confidence interval for the difference in recovery rates between 
drugs A and B. 


We want to compare two pesticides. In two independent random samples, 
pesticide A killed 15 of 35 cockroaches and pesticide B killed 20 of 35 
cockroaches. 


(a) Perform a test to compare the two pesticides. 
(b) Find a 95% confidence interval for the efficiency difference between A and 
B. 


The English statistician Karl Pearson once tossed a coin 24,000 times and 
obtained 12,012 heads. The English mathematician John Kerrich tossed a coin 
10,000 times and obtained 5,067. Find a confidence at the level 0.99 for the 
difference of the probabilities of heads for the 2 coins. 
We want to compare the support for initiative A in March and October. Two 
independent random samples are taken in March and October. 

In March a poll indicates that 104 out of 250 voters are in favor of initiative 
A. In October another (independent of the first one) indicates that 140 out of 
300 voters are in favor of initiative A. 


(a) Is there a significant difference between the support in March and October? 
(b) Find a 95% confidence interval for the difference of proportions of voters 
in favor of initiative A in March and in October. 


Chapter 11 ®) 
Estimating and Testing Means sei 


1 Testing a Mean 


The ideas in this chapter are the same as in the previous chapter. The only difference 
is in the computations. When we deal with a proportion there is only one unknown 
parameter, namely p. In this chapter we deal with a more general distribution. We 
will have two unknown parameters, the mean ju and the variance 07. We start with 
an example. 


Example I A manufacturer of lamp bulbs claims that the mean lifetime of his lamp 
bulbs is larger than 1000 h. The average lifetime in a sample of 200 bulbs is 1016h 
with a standard deviation of 102 h. Test the claim of the manufacturer. 

Let jz be the true lifetime mean of a lamp bulb. The claim of the manufacturer is 
that ~ > 1, 000. So this should be our alternative hypothesis. Therefore, the test is 
going to be 


Ho : w = 1000 
Hy, : > 1000 


To estimate jz we use the sample average 


— | 
Ri shears An): 


In this example, n= 200, X;,..., X, are the observed lifetimes for each bulb in 
this sample and X = 1016. 
To compute the P value we apply the Central Limit Theorem. That is, 


X-—p 


S/Jn 


© Springer Nature Switzerland AG 2022 131 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_11 


132 11 Estimating and Testing Means 


can be approximated by a standard normal distribution provided n is large enough 
(usually n > 25 will do). Recall that S* estimates the variance o” and is defined by 


In this example we are told that S = 102. Hence, the P value is, 


P =P(X > 1,016] = 1,000) 


X — 1,000 as 1,016 — T0Ne 
S/J/n S/J/n 
=P(Z > 2.21) = 0.01. 


=P( 


So at any level larger than 0.01 we reject Ho. In particular at the standard level 0.05 
we reject Ho. That is, there is statistical evidence supporting the manufacturer’s 
claim. 

We now summarize the method. 

Assume we have a large random sample with average X and standard deviation 
S. Let to be a fixed number and 


os X — Mo 
S/J/n - 
¢ For the test 
Ho : & = bo 
Hg: 4 > LO 
the P value is P = P(Z > Zz). 
¢ For the test 
Ho : & = bo 
Hg: & < MO 
the P value is P = P(Z < Zz). 
¢ For the test 
Ho : & = bo 
Ha : LF [Lo 


the P value is P = 2P(Z < z)ifz < 0. The P valueis P= 2P(Z > z)ifz > 0. 


1 Testing a Mean 133 


Example 2 A farmer is supposed to deliver chickens to a grocery store that weigh 3 
pounds in average. The grocery store claims that the chickens are in average under 
3 pounds. A random sample of 100 chickens has an average of 46 ounces and a 
standard deviation of 5 ounces. Test the claim of the store. 

The claim we want to test is u < 48. This should be our alternative hypothesis. 
So we perform the test 


Ho: w= 48 
Ay: uw < 48 
We compute the P value. First, 
xX— Lo 46 — 48 
z= =Z< =— 


S/n 5/100 


Hence, P = P(Z < —4). This P value is extremely small. So at any reasonable 
level (1%, 5% or 10%) we should reject the null hypothesis. That is, there is strong 
statistical evidence to support the claim of the grocery store. 


Example 3 We want to test whether a certain medication has an effect on the 
blood pressure. A random sample of 100 male patients 35 to 44 years is given 
the medication. The average blood pressure in this group is 122 with a standard 
deviation of 10. For this age range the blood pressure in the population is known to 
be 120. 
Since we are told to just test any effect we should do a two-way test. 
Ho Shs 120 


Hy: uw # 120 
We compute z first, 


_ 122=120 _ 
~ 10//100 


Since z > 0 the P value is 
P=2P(Z > 2) = 2(0.5 — 0.4772) = 0.0456. 


This P value is very close to 5%. We reject the null hypothesis at the 10% level. At 
the 10% level there is evidence that the medication has an effect on blood pressure. 
Note that at the 1% level we would not reject the null hypothesis. 


134 11 Estimating and Testing Means 
2 Confidence Interval for a Mean 


For a sample large enough we have the following formula for a confidence interval 
for a mean. 


e Let X reree X, be ani.i.d. sample of a distribution with mean pw and variance 
o. Let X and S be the sample average and standard deviation. Let a be in (0, 1) 
and Z be a standard normal. Let z, be such that 


P(\|Z| < zq) =a. 


Let 
S 


C= fa; 


Ja 


then (X — c, X +c) is a confidence interval for jz with confidence a. 
The formula above is a consequence of the Central Limit Theorem. 


Example 4 Assume that 500 lamp bulbs have been tested and the average lifetime 
for this sample has been 562 days with a standard deviation of 112 days. Give a 
confidence interval with confidence 90% for the mean lifetime of this brand of lamp 
bulb. 

Note that 


P(\Z| < Zq) = 2P0 < Z < Zq) = 0.9. 


From the normal table we get zz = 1.64. Since n = 500 and S = 112, 


S 
C= %q— = 8. 


Jn 
Using that the sample average is X = 562, the confidence interval is 
(562 — 8, 562 + 8) = (556, 570), 


with confidence 90%. 


Example 5 Find a confidence interval for the mean in Example 4 with confidence 
0.95. 
The only thing that changes is z,. For a = 0.95, z_ = 1.96. Hence, 


Spon | 20 
Cc = Zq— = 1.96 — ~ 10. 
fh /500 


3 Testing Two Means 135 


So with confidence 0.95, the confidence interval is 
(562 — 10,562 + 10) = (552, 572) 


for the mean lifetime of a lamp bulb. 
Comparing Example 4 and 5 we see that when we raise the confidence we lose 
in precision. That is, if a increases so does c. 


3 Testing Two Means 


We start with an example. 


Example 6 Test the claim that lamp bulbs from brand A last longer than lamp bulbs 
from brand B. A sample of 200 lamp bulbs from A has a sample average of 1052h 
and a standard deviation of 151. A sample of 100 lamps from B has a sample average 
of 980 h and a standard deviation of 102. 

Let 4; and p22 be, respectively, the mean lifetimes of the lamp bulbs from 
manufacturers A and B. We want to test whether 4; > (42. This is our alternative 
hypothesis. We perform the test 


Ao: 1 = p22 
Ag: 1 > M2 


We rewrite the test as a one parameter test 


Ho: “1 — v2 =0 
Ha: 1 — 2 >0 


Let n; and n2 be the sample sizes from A and B, respectively. We denote the 
sample averages from A and B by X, and X2, respectively, and the sample standard 
deviations by S$; and Sp. 

We compute the P value. 


P = P(X, — X2 > 72|u1 — w2 = 0). 


Assuming the sample sizes are large enough and that the two random samples are 
independent we get by the Central Limit Theorem that 


72 


4) Sz /ny + S3/n2 


P~ P(Z> ) = P(Z > 4.87). 


136 11 Estimating and Testing Means 


This is an extremely small P value. At any reasonable level we reject Ho. There is 
strong statistical evidence supporting the claim that lamp bulbs from A last longer 
than lamp bulbs from B. 

We now summarize the method for testing two means. 

Assume we have two large and independent random samples with sizes n; and 
n, respectively. Denote the sample averages by X, and X>. Denote the two sample 
standard deviations by S$; and S2. Let 


X,— Xo 


y St/mi + S3/n2 


a 


e For the test 


Ao: M1 = 2 
Ag: [L1 > M2 
the P value is P = P(Z > Zz). 
e For the test 
Ao: 1 = p2 
Ag : [h1 < po 
the P value is P = P(Z < Zz). 
e For the test 
Ao: M1 = b2 
Ag: hy # M2 


the P value is P = 2P(Z < z)ifz < 0. The P valueis P= 2P(Z > z) ifz > 0. 


4 Two Means Confidence Interval 


For two large and independent samples we have the following formula for a 
confidence interval for the difference of two means. 


* Let X; and S| be the first sample average and standard deviation. Let X> and S> 
be the second sample average and standard deviation. Let a be in (0, 1) and Z be 
a standard normal. Let z, be such that 


P(|Z| < 2g) =a. 


Problems 137 


Let 


c= Zax) Si [m1 + S3/n2, 


then (X; — X2 —c, X; — X2 +c) is a confidence interval for 7; — 2 with 
confidence (or at the level) a. 


Example 7 Yn Example 6 we rejected the null hypothesis. In order to estimate how 
much longer lamp bulbs from brand A last we may compute a confidence interval. 
With 95% confidence we get the following confidence interval for 41 — j22: 


(X, — X2 —c, X; — X2+¢) 


where 


c= axl Seti + S3/np. 


For a = 0.95, zq = 1.96 and c ~ 29. So the confidence interval for “1 — 42 with 
0.95 confidence is (43, 101). 


Problems 


1. 


2. 


Consider the following scores: 87, 92, 58, 64, 72, 43, 75. Compute the average 
score X and the standard deviation S. 

A researcher has measured the yields of 40 tomato plants and found that the 
sample average yield per plant to be 5 pounds with a sample standard deviation 
of 1.7 pound. 


(a) Test whether the mean yield is above 4.5 pounds. 
(b) Find a confidence interval for the mean yield at the level 0.9. 


. Asample of 25 six-year-old boys heights in population A average 85 cm with a 


standard deviation of 5cm. In population B it is known that the mean average 
is “4 = 90cm with a standard deviation o = 6cm. 


(a) Test whether the heights in populations A and B differ. 
(b) Find a confidence interval at the level 0.95 for the mean height of a 6-year- 
old boy in population A. 


. We would like to test whether a certain brand of radon detectors is under 


measuring radon levels. We take a random sample of 25 detectors. Each detector 
is exposed to 100 standard units of radon. The average reading is 97 with a 
standard deviation of 8. Perform the test. 


138 


5. 


10. 


11. 


11 Estimating and Testing Means 


We would like to test whether Professor A is harsher than Professor B. The 
same final exam is given to several sections of calculus. Each professor gets 
to grade 50 papers taken at random from the pile. Professor A has an average 
of 75 with a standard deviation of 12. Professor B has an average of 79 with a 
standard deviation of 8. Perform the test. 


. A researcher wants to compare the yield of two varieties of tomatoes. The first 


variety of 40 tomato plants has a sample average yield per plant of 5 pounds 
with a sample standard deviation of 1.7 pound. The second variety of 50 tomato 
plants has a sample average yield per plant of 4.5 pounds with a sample standard 
deviation of 1.2 pound. 


(a) Test the claim that variety | yields more than variety 2. 
(a) Find a confidence interval for the difference in mean yield at the level 0.9. 


. We compare lamp bulbs from two different brands. The average lifetime for a 


sample of brand A is X; = 562 days; the standard deviation for the sample 
is S$; = 112 for ny = 500. For brand B, n2 = 300 lamp bulbs have been 
tested. The average lifetime for this sample is X2 = 551 days and the standard 
deviation for the sample is $2 = 121. 


(a) Test whether the two brands have different mean lifetimes. 
(b) Give a confidence interval at the level 95% for the difference of mean 
lifetimes of the two brands of lamp bulb. 


. With pesticide A in a sample of 40 plants the average yield per plant is 5 pounds 


with a sample standard deviation of 1.7 pound. Using pesticide B in a sample 
of 30 plants the average yield is 4.5 pounds with a standard deviation of 1.5 
pound. Compare the yields of the two treatments. 


. To study the effect of a drug on pulse rate the available subjects were divided at 


random in two groups of 30 persons each. The first group was given the drug. 
The second group was given a placebo. The treatment group had an average 
pulse rate of 67 with a standard deviation of 8. The placebo group had an 
average pulse rate of 71 with a standard deviation of 10. Test the effectiveness 
of the drug. 
Test the claim that girls are in average 10 points above boys in an aptitude test. 
The aptitude test is given in 6th grade. A sample of 150 boys average 75 with 
a standard deviation of 10. A sample of 160 girls average 87 with a standard 
deviation of 8. 
I want to test whether the random number generator on my computer is 
consistent with a uniform distribution on [0, 1]. It generated 100 random 
numbers with average 0.53 and standard deviation 0.09. 

Perform a test to check whether this is consistent with a uniform distribution. 


Chapter 12 ®) 
Small Samples sei 


1 Student Tests 


In the previous two chapters we have used the Central Limit Theorem to get 
confidence intervals and perform hypothesis testing. Usually the CLT may be safely 
applied for random samples of size 25 or larger. In this chapter, we will see 
alternatives for smaller sample sizes. 

If we have a random sample of size n from a normal population then it is possible 
to compute the exact distribution of the normalized sample average (recall that the 
CLT only gives an approximate distribution). We now state this result precisely. 


e Assume that X;, X2,..., Xn» are observations from a random sample taken in 
a normal population. Let jz be the true mean. Let X and S be respectively the 
sample average and the sample standard deviation. Then 


Xp 


S/./n 


follows a Student distribution with n — | degrees of freedom. A Student random 
variable with r degrees of freedom will be denoted by f(r). 


Student distributions are very similar to the normal standard distributions, they 
are bell shaped and symmetric around the y axis. The only difference is that the 
tails of the Student distribution are longer than the tails of the standard normal 
distribution. That is, the probability of a large observation is higher for a Student 
distribution than for a standard normal distribution. However, as the number of 
degrees of freedom increases Student distributions are closer and closer to the 
standard normal distribution. 


© Springer Nature Switzerland AG 2022 139 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_12 


140 12 Small Samples 


Example I Assume that the weights of 5 nine years old boys are 25, 28, 24, 26 and 
24 kilograms in a certain population. Find the a confidence interval for the mean 
weight of 9 years old boys in that population. 

We first need to compute the sample average and standard deviation. 


— 2x244+2 2) 2, 
r= x -=* 6+ 8 55h 


We have the following computational formula for S* 


2 1 : 2 ay) 
s? = SX? — ——(X)’. 
an n—-1 


n—1 


We compute the sum of the squares first 


n 
SX = 287 
i=l 
and we get 
1 5 
S* = -3237 — =25.47 = 2.8. 
4 4 


The formula for a confidence interval is very similar to the ones we have seen before. 
For a confidence a, 


where 
P(|t(a—1)| < t) =a. 


In this example n = 5. If we pick a = 0.9 the Student table for 4 degrees of freedom 
gives tg = 2.13. Therefore, the confidence interval at the level 0.9 is 


(X —c,X +0) = (25.4 — 1.6, 25.4 + 1.6) = (23.8, 27). 


Note that if we had a large sample we would be using z, = 1.64 instead of t, 
and the confidence interval would be narrower. 


2 Two Means Student Tests 141 


Example 2 A certain pain medication is said to provide more than 3h of relief. We 
would like to test this claim. The medication is given to 6 patients. The average 
relief time is 200 min and the standard deviation is 40 min. The test is 


Ho: w = 180 
Hy: uw > 180 


Assuming that the sample is normal with mean 4 = 180 


X — 180 
S/Vn 


follows a Student distribution with n — 1 = 5 degrees of freedom. Let 


x-1 
t= eS = 1.22. 
S//n 


The P value is therefore P = P(t(5) > 1.22) = 0.15. Since 1.16 < 1.22 < 1.48 
we read in the Student table that the P value is between 0.1 and 0.15. Hence, at the 
5% level the null hypothesis is not rejected. There is not enough evidence to support 
the claim that the medication provides more than 3h of relief. 


2 Two Means Student Tests 


Example 3 A company claims that its new fertilizer works better than the old one. 
To test the claim 10 identical small plots are fertilized, 5 with the new fertilizer and 
5 with the old fertilizer. The average yield with the new fertilizer is 123 pounds of 
tomatoes and the standard deviation is 6 pounds. For the old fertilizer the average is 
116 pounds and the standard deviation is 7 pounds. 

The samples are too small to use the CLT. However, we can perform a Student 
test. Let 


_ (m1 — IS? + (ra — 1)°S3 
ae nj tn2—2 , 


s2 


Assuming that the yields are normally distributed and that the true variances are 
equal we have that under 1 = [2 


142 12 Small Samples 


follows a Student distribution with n; + n2 — 2 degrees of freedom. We perform the 
following test 


Ao: 1 = p22 
Ag: 1 > M2 


Numerically, X, = 123, S; = 6, X2 = 116, S$} = 7, n| =n2 =5. Hence, 


4x 627+4~x 7 
os EE es. 
8 
Let 
X,-X 
pe 2S ari 
1 1 
Sela: 


Since nj + n2 — 2 = 8, the P value is 
P= P(t(8) > 1.7). 


According to the Student table the P value is between 5% and 10%. So at the 5% 
level we would not reject the null hypothesis. There is not enough evidence to claim 
that the new fertilizer yields more tomatoes. 

Note that we assumed that the variances were the same to compare the two 
means. When the variances are not the same a different Student test may be 
performed. See for instance 6.5 in ’Statistical Methods in Biology’ by N.T.J. Bailey, 
Cambridge University Press, Third Edition. 


3 Student Tests for Matched Pairs 


The two sample techniques we have seen so far apply to two samples that are 
independent of each other. In this section we are going to see a technique that applies 
to two samples that are not independent. 

Assume we want to test the effectiveness of a course on students. We test 
the students before and after a course to assess the effectiveness of the course. 
We should not analyze such data as two independent samples. We do not have 
independence since we are testing the same individuals in the two samples. We 
have a so called matched pairs sample instead. In such a case we should analyze the 
difference between the two tests for each individual. We then apply a one sample 
technique to the differences. Next we treat such an example. 


4 The Sign Test 143 


Example 4 Does a certain course help the students? 10 students are given two 
similar tests before and after a course. Here are their grades and the difference 
between the two tests. 


Before 71 72 85 90 55 61 76 78 719 85 
After 73 75 89 92 50 68 82 81 86 80 
After-before 2 3 4 2 —5 7 6 3 7 —5 


Let yu be the true gain after the course. We would like to test 


Ap: uw=0 
Ay: uw>o0 


The sample of differences has an average gain of X = 2.4 and a standard deviation 
of S = 4.3. Assuming that the gains are normally distributed we can use a Student 
test. Let 


Me fess 75 
S10. > 


The P value is 
P = P(t(Q9) > 1.75). 


The Student table indicates that the P value is between 0.05 and 0.1. At the 5% level 
we cannot reject the null hypothesis. There is not enough evidence to claim that the 
course increases the test scores. 


e Note that we can use this matched pair technique for large samples as well. If the 
sample size is large enough we just use the normal table. 


4 The Sign Test 


For a sample which is not normal and is too small to use the CLT we still have 

the following test at our disposal. The so called sign test may be performed 

without assuming that the random variables follow a given distribution. We still 

need a sample of n i.i.d. random variables but nothing will be assumed about the 

distribution of these random variables. This type of test is said to be non parametric. 
We will explain the sign test on an example. 


144 12. Small Samples 


Example 5 We would like to test the claim that the median height of a 6 years old 
boy in this population is at least 84 cm. Assume that the heights in cm of 11 six 
years old boys are the following: 80, 93, 85, 87, 79, 85, 85, 86, 89, 90 and 91. Let 
m be the true median of the population. We will perform the following test. 


Ho: m= 84 
Hg :m > 84 


Recall that the median of a random variable X is such that 
P(X >m) = P(X <m) = 1/2. 


Let B be the number of observations above 84. Under the null hypothesis m = 84, 
B is a binomial random variable with parameters n = 11 and p = 1/2. This is so 
because there is the same chance for an observation to fall below or above 84. The 
sign test is based on the random variable B. Since Hy : m > 84, we will reject the 
null hypothesis if B is large. In this sample we note that B = 9. We compute the P 
value for the sign test 


_ Sie. Sl are 6 4 din fal r 
P= P(B > 9m = 84) = (1) a2) +({) 02 +(h)am 


We get a P value of 0.03. Thus, at the 5% level we reject the null hypothesis. There 
is statistical evidence that the median height in this population is at least 84 cm. 


Example 6 Test the claim that the median weight loss for a certain diet is larger 
than 5 pounds. The diet is tested on 8 people. Here are the weights before and after 
the diet 


Before 181 178 205 195 202 176 180 177 
After 175 171 196 192 190 168 176 171 
Weight loss 6 7 9 3 12 8 4 6 


Let m be the true median weight loss. We want to test 


Hjp:m=5 
Hg:m>5 


Let B be the number of weight losses larger than 5. Under m = 5, B follows a 
binomial with parameters n = 8 and p = 1/2. Thus, the P value is 


P= P(B>6|m=5)= (¢) (1/2)8 + (5) (1/2)8 + (5) (1/2)8 = 0.14. 


Problems 145 


At the level 5% or 10% we do not reject the null hypothesis. That is, there is not 
enough evidence to claim that the median weight loss of the diet is at least 5 pounds. 


Example 7 Farm worker X claims that he picks more apples than farm worker Y. 
Here are the quantities picked by both workers over 7 days. 


Y 172 165 161 184 174 142 190 


We test 

Ho: X and Y pick the same quantity. 

H,: X picks more than Y. 

Let B be the number of days that X outperforms Y. In this sample B = 5. Under 
the null hypothesis B is a binomial with parameters n = 7 and p = 1/2. The P 
value is 


P(B > 5|Ho) = 0.23. 


We do not reject the null hypothesis at the 5% level. There is not enough evidence 
to claim that X picks more apples than Y. 


Problems 


1. (a) What is P(t(3) > 2)? 
(b) Compare (a) to P(Z > 2). 

2. Consider the following sample of 20 observations. 1.6859 2.3622 0.1358 
2.4462 1.0006 0.1026 0.3264 0.7916 3.1584 3.3492 0.1715 3.5265 3.1504 
0.6643 1.6108 0.1530 0.5478 2.4738 1.5712 3.2063 


(a) Compute the sample average and the sample variance. 

(b) What percentage of the observations are within | standard deviation of the 
average? 

(c) What percentage of the observations are within 2 standard deviations of the 
average? 

(d) Is it reasonable to assume that these observations come from a normal 
population? 


3. Some components in the blood tend to vary normally over time for each 
individual. Assume that the following levels for a given component were 
measured on a single patient: 5.5, 5.2, 4.5, 4.9, 5.6 and 6.3. 


(a) Test the claim that the mean level for this patient is above 4.7. 
(b) Find a confidence interval with 0.95 confidence for the mean level of this 
patient. 


146 12 Small Samples 


4. Assume that a group of 10 eighth graders taken at random averaged 85 on a test 
with a standard deviation of 7. 


(a) Is there evidence that the true mean grade for this population is above 80? 
(b) Find a confidence interval for the true mean grade. 
(c) What assumptions did you make to answer (a) and (b)? 


5. A sample of 8 students were given a placement test. After a week of classes the 
students were given again a placement test. Here are their scores. 


Before 71 78 80 90 55 65 76 77 
After 75 71 89 92 61 68 80 81 


(a) Test whether the scores improved after one week by performing a Student 
test. 
(b) Test whether the scores improved after one week by performing a sign test. 


6. In an agricultural field trial, researchers tested two varieties of tomatoes in 10 
plots. In eight of the plots variety A yielded more than variety B. Is this enough 
evidence to say that variety A yields more than variety B? 

7. A diet was tested on 9 people. Here are their weights before and after the diet. 


Before 171 178 180 190 165 165 176 177 182 
After 175 171 182 161 168 156 165 171 175 


(a) Test whether the diet makes lose at least 5 pounds by performing a Student 
test. 
(b) Perform a sign test for the hypothesis in (a). 


8. A test given to 12 male students has an average of 75 and standard deviation 
of 11. The same test given to 10 female students has an average of 81 with a 
standard deviation of 8. 


(a) Is there evidence that the female students outperform the male students? 
(b) Find a confidence interval for the difference of the true means. 


9. Does Calculus improve the algebra skills of the students? At the beginning 
of the semester 100 Calculus students were given an algebra test and got an 
average of 79 and a standard deviation of 15. At the end of the semester the 
same 100 students were given another algebra test for which the average was 
85 and the standard deviation was 12. The standard deviation for the differences 
between the two tests is 5. Perform a test. 

10. We would like assess the effect of a medication on blood pressure. The 
medication is given to 12 patients. This group has an average blood pressure 
of 131 and a standard deviation of 5. Another group of 10 patients is given a 
placebo and their average is 127 and the standard deviation is 4. Perform a test. 


Chapter 13 
Chi-Squared Tests 


1 Testing Independence 


In this section we will test whether two variables are independent. 


® 


Check for 
updates 


Example I Is there a relation between the level of education and smoking? Assume 


that a random sample of 200 was taken with the following results: 


Education Smoker] Non-smoker 
8 years or less |9 38 
12 years 21 80 
16 years 5 47 


In this test the null hypothesis Ho will be that education level and smoking are 
independent. The alternative hypothesis H, is that there is an association between 
the two variables. In order to decide whether to reject the null hypothesis we will 
compare the counts in our sample to the expected counts under the null hypothesis. 
We now explain how to compute the expected counts under the null hypothesis. 

The probability that someone in the sample has 8 years or less of education is 


9438 47 
200 ~ 200° 


The probability that someone in the sample be a smoker is 


Paes 35 
200 3 ~—-2007 


Recall that the events A and B are independent if and only if 
P(AN B) = P(A)P(B). 


© Springer Nature Switzerland AG 2022 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_13 


147 


148 13 Chi-Squared Tests 


Thus, under the assumption that level of education and smoking are independent 
the probability that someone taken at random in the sample has 8 years or less of 
education and smoke is 


47 35 
— x —. 
200 200 


The expected number among 200 people who have 8 years or less of education and 
smoke is therefore 


47 35 47x 35 


2 _ 
00 x 500 * 300 = ~ 200 


More generally, 


e The expected count in a cell under the independence assumption is given by 


row total x column total 
expected count = 


sample size 


We now go back to the data of Example | and compute the expected counts for 
all the cells. 
Expected Counts 


Education Smoker Non-smoker 
8 years or less 8.225 38.775 
12 years 17.675 83.325 


16 years 9.1 42.9 


To compute the P value of this test we use 


x2 - s (observed-expected)* 
7 expected , 


Intuitively, since X? measures how far away each count in the sample is from the 
expected count under Ho, we should reject independence if X* is large enough. 
More precisely, 


* To test whether two variables are related we use the statistic X*. Let r and c be 
the number of rows and columns, respectively. The P value for this test is given 
by 


P = P(x? > X”), 


where x” follows approximately a Chi-squared distribution with (r — 1)(c — 1) 
degrees of freedom. 


2 Goodness of Fit Test 149 


The use of a chi-square distribution gets better as the sample size increases and 
is more reliable if every expected cell has a count of 5 or more. 
We now go back to the data of Example | to perform the test. We compute X?. 


2 _(9— 8.225)? (38 — 38.775)? (21 — 17.675)? 


8.225 38.775 17.675 
(80 — 83.325)? (5—9.1)? (47 — 42.9)? 
83.325 9.1 42.9 
=3.09. 


In this example, there are three rows so r = 3 and two columns so c = 2. Thus, 
(r — 1)(c — 1) = 2. The P value for Example | is 


P = P(x7(2) > 3.09). 


According to the chi-squared table the P value is larger than 0.1. At the 5% level 
we do not reject Hp. There is not enough evidence to support the claim that there is 
an association between education level and smoking. 


2 Goodness of Fit Test 


We now turn to another important Chi-squared test. We start with an example. 


Example 2 Consider the following 25 observations: 0, 3, 1, 0, 1, 1, 1, 3, 4, 3, 2, 0, 
2, 0, 0, 0, 4, 2, 3, 4, 1, 6, 1, 4, 1. Could these observations come from a Poisson 
distribution? 

We summarize the data in the following table: 


Values 0) 1/2)3/4 or more 
Observed counts 6)7)3)4/5 


Recall that a Poisson distribution has only one parameter, which is also its mean. 
We use the sample average to estimate the mean. We get 


— 47 
X= — = 1.88. 
25 


Let N be a Poisson random variable with mean 1.88. Then, 


P(N =0) =e! 88 = 0.15 


150 13 Chi-Squared Tests 


and therefore the expected number of 0’s in 25 observations is 25 x e~!88 = 3.81. 
Likewise we have that 


P(N = 1) = 1.88e7! 88 = 0.29. 


The expected number of 1’s in 25 observations is 7.17. Similarly, the expected 
number of 2’s is 6.74 and the expected number of 3’s is 4.22. The probability that 
N is 4 or more is 


P(N > 4) = 0.12. 


Thus, the expected number of observations larger than or equal to 4 is 3. 
This yields the following table for the expected counts: 


Values 0 1 2 3 4 or more 
Expected counts 3.81 7.17)6.74)4.22|3 


As for the previous Chi-squared test we compare the expected and observed 
counts. As before, let 


ye - > (observed-expected)* 
7 expected 


¢ Consider the null hypothesis Hp: The observations follow the distribution F 
(such as the Poisson distribution). The alternative hypothesis H, is that the 
observations do not follow the distribution F’. Let r be the number of cells and d 
be the number of parameters that are estimated for the distribution F. Then, the 
P value is given by 


P = P(x? > X”), 


where the random variable x* follows approximately a Chi-squared distribution 
with r — 1 — d degrees of freedom. 


We apply this method on Example 2. 
The statistic X* is easily computed, 


= (observed-expected)* 


x?-= = 4.68. 


expected 


We have r = 5. We had to estimate one parameter (the mean of the Poisson 
distribution) therefore d = 1. Since r — 1 —d =5—1-—1=3, the P value is, 


P = P(x7(3) > 4.68). 


2 Goodness of Fit Test 151 


The Chi-squared table indicates that the P value is larger than 0.1. We do not reject 
the null hypothesis at the 5% or 10% level. That is, these observations are consistent 
with a Poisson distribution. 


Example 3 The observations of Example 2 were in fact generated as Poisson 
observations with mean 2 by a random generator. We now test whether these 
observations are consistent with a Poisson distribution with mean 2. The null 
hypothesis is now: The observations follow a Poisson distribution with mean 2. 
The only difference with Example | is that we are estimating a parameter anymore. 
Hence, d = 0. We compute the expected counts using a mean equal to 2. 


Values 0 1 2 3 4 or more 
Expected counts | 3.38 | 6.77 |6.77 | 4.51 | 3.75 


This time X? = 4.62 andr —d —-1=5—0-—1=4. Thus, 
P = P(x7(4) > 4.62) > 0.1. 


We do not reject the null hypothesis. These observations are consistent with a mean 
2 Poisson distribution. 
The following example deals with continuous distributions. 


Example 4 Are the following observations consistent with a normal distribution? 

66, 64, 59, 65, 81, 82, 64, 60, 78, 62 

65, 67, 67, 80, 63, 61, 62, 83, 78, 65 

66, 58, 74, 65, 80 

The sample average is 69 and the sample standard deviation is 8 ( we are rounding 
to the closest integer to simplify the subsequent computations). We will now try to 
fit the observations to a normal distribution with mean 69 and standard deviation 8. 

We first pick the number of cells. Keeping in mind that the Chi-squared 
approximation is best when there are at least 5 expected counts per cell we pick 
5 cells. Using the standard normal table we find the 20th, 40th, 60th, and 80th 
percentiles. We read in the standard normal table that 


P(Z < —0.84) =0.2 
P(Z < —0.25) =0.4 
P(Z < 0.25) = 0.6 
P(Z < 0.84) = 0.8 


152 13 Chi-Squared Tests 


Recall that if X is anormal random variable with mean 69 and standard deviation 
8, then 


X — 69 
8 


is a standard normal random variable. So, for instance, the 20th percentile of a 
normal random variable with mean 69 and standard deviation 8 is 


69 + 8(—0.84) = 62.28. 
Likewise the 40th, 60th, and 80th percentiles of a normal random variable with mean 


69 and standard deviation 8 are 67, 71, 76. We have rounded these percentiles to the 
nearest integer. We now compare the observed and expected counts. 


Values (—00,62] | (62, 67] | (67,71) | [71, 76) | [76, co) 
Observed counts | 6 11 0 1 7 
Expected counts | 5 5 5 5 5 


We compute the statistic 


Sp (observed-expected)* = 


x?= = 16.4. 


expected 


We had to estimate two parameters (ju and a) sod = 2. Hence, X* is approximately 
a chi-square random variable with r — d — 1 = 5 —2 — 1 = 2 degrees of freedom. 
We get the P value 


P(x7(2) > 16.4) < 0.01. 


So we reject the null hypothesis at the 1% level. These observations are not 
consistent with a normal distribution. 

Observe that the goodness of fit test is especially useful when we reject the null 
hypothesis. In that case we conclude that the observations are unlikely to come 
from the distribution we are testing. On the other hand when we do not reject the 
null hypothesis we are simply saying that the observations are consistent with the 
distribution we are testing. There may be a number of other distributions for which 
this is true as well. 


Problems 153 
Problems 


Problems | and 2 use data from the American Mathematical Society regarding new 
doctorates in mathematics (Notices of the AMS January 1998). Types I, II, and Il 
are groups of mathematics departments as ranked by the AMS. 


1. The following table gives the number of new PhDs in mathematics according to 
their area of concentration and the type of department that granted their degree: 


Algebra | Geometry | Probability and stat 


TypeI | 21 28 9 
Type Il | 10 7 4 
Type I | 3 1 3 


Is there an association between the type of the institution and the areas of 
concentration? 
2. The table below breaks down the number of graduates in 1997 according to 
their gender and area of concentration. 


Algebra | Geometry | Probability and stat 
Male 123 118 194 
Female | 37 23 98 


Is there an association between areas of concentration and gender? 
3. We would like to test whether marital status and gender are independent. A 
random sample gave the following results: 


Men | Women 
Never married 21 9 
Currently married | 20 39 
Previously married | 7 7 


Perform a test. 
4. We toss two coins 90 times and we count the number of heads. We get 


Number of heads | 0 1 2 
Counts 21 |42 | 27 


Are these results consistent with two fair coins? 


154 


13 Chi-Squared Tests 


. Test whether the following observations are consistent with a Poisson distribu- 


tion: 1, 4, 2, 7, 4, 3, 0, 2,5, 2, 3, 2, 1,5,5, 0,3, 2,2, 2,2, 1,4, 1, 2, 4. 


. Test whether the following observations are consistent with a standard normal 


distribution: 

1.70, 0.11, 0.14, 0.81, 2.19 

—1.56, —0.67, 0.89, —1.24, 0.26 
—0.05, 0.72, 0.29, —1.09, —0.43 
—2.23, —1.68, 0.23, 1.17, —0.87 
—0.28, 1.11, —0.43, —0.16, —0.07 


. Test whether the following observations are consistent with a uniform distribu- 


tion on [0,100]: 
99,53,18,21,20,53,58,4,32,51, 
24,51,62,98,2,48,97,64,61,18, 
25,57,92,72,95 


. Let T be an exponential random variable with mean 2. That is, the density of T 


is f(t) = let for t > 0. Find the 20th percentile of T. 


. Test whether the following observations are consistent with an exponential 


distribution: 
13,7,14,10,12,8,8,8,10,9, 
8,10,5,14,13,7,11,11,10, 
8,10,10,13,9,10 


Chapter 14 M®) 
Design of Experiments spooks 


The statistical tests we have introduced in previous chapters rely on designing 
statistically meaningful experiments. In this short chapter we will describe ways 
to design such experiments and we will also describe common mistakes. 


1 Double Blind Design 


Assume we want to assess the effectiveness of a drug on a certain disease. The best 
way to do so is the so-called double blind placebo control experiment. Assume we 
have a group of patients sick with the disease. We randomly assign each patient 
to the treatment group (i.e., the group that will get the drug) or the control group 
(i.e., the group that will get the placebo). Neither the patient nor the medical staff 
knows which patient is getting what. This is why this is called a double blind design. 
The point of randomizing is to eliminate the influence of unknown variables. The 
point of placebo control is to blind the patients and the medical staff and therefore 
eliminate outside intervention from the experiment. 

Double blind design is the gold standard of experimental design but many times 
it cannot be applied. For instance, someone noticed that most professional hockey 
players in Canada were born in the first 4 months of the year. This conclusion 
was reached by an observational study and cannot be reached otherwise. To find 
an explanation for the bias toward the first 4 months of the year, see the first chapter 
in “Outliers” by Malcolm Gladwell. 


© Springer Nature Switzerland AG 2022 155 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_14 


156 14 Design of Experiments 
2 Data Dredging 


Thanks to our computing capabilities it is easy to have huge data sets at one’s 
fingertips. It may be tempting to exhaustively test multiple hypotheses based on 
the data. Such practice is called data dredging (or data fishing) and frequently leads 
to false conclusions. Typically one looks for a P value of 0.05 or less. If enough 
tests are performed, then about 5% of the tests will lead to erroneously reject the 
null hypothesis. 

To avoid drawing false conclusions, if we use a data set to formulate a hypothesis, 
then we should not use that same data set to test this hypothesis. We should instead 
collect fresh data to test that hypothesis. 


Problems 


1. I want to estimate the number of high school math courses a typical UCCS 
freshman has taken. To do my estimate I use the following methods. 


(a) I pick the first 100 students in alphabetical order in the incoming students 
list. I count their math courses. Is this acceptable? 

(b) I check the transcript of every student in my Calculus 1 class. Is this 
acceptable? 


2. In 1936 the presidential election was F.D.Roosevelt against A. Landon. The 
Literary Digest magazine sampled 2.4 million individuals and predicted the 
victory of Landon by 57% to 43%. However, Roosevelt won by 62% to 38%! 
How can such a large sample be so wrong? 

3. In a poll it was found out that 19% of biology teachers believe that humans and 
dinosaurs lived at the same time. 


(a) What is the significance of this survey if it was sent to 20,000 teachers and 
there were 200 responses? 

(b) What if it was sent to 400 teachers picked at random and there were 200 
responses? 


4. I perform 100 independent statistical tests at the 5% level. What is the probability 
that I will draw at least one wrong conclusion? 

5. The vast majority of P values reported in the medical literature are very close to 
5%. Why is this suspicious? What type of problem does this reveal? 

6. In the early 1990s it was recommended that all men 50 years old or older undergo 
regular prostate cancer screening in the USA. In the UK on the other hand there 
was no such screening program. The 5 year survival rate for prostate cancer was 
40% in the UK and 90% in the USA. So prostate cancer screening saves lives. 
Not so fast! The mortality rates for prostate cancer were about the same in the 
USA and in the UK! What was going on? 


Problems 157 


7. The two tables below list graduate admissions data for majors A through F at the 
U.C. Berkeley. The first table is for men, the second is for women. 


Number of applicants Percent admitted 


A 825 62 
B_ 560 63 
C 325 37 
D 417 33 
E 191 28 
F 373 6 
Number of applicants Percent admitted 

A 108 82 
B 25 68 
C 593 34 
D 375 35 
E 393 24 
F 341 7 


(a) Compare the overall admission rates for men and women. Does it seem like 
there is sex bias? 

(b) Looking now at admission rates for each major argues that there is no sex 
bias. 

(c) Explain the apparent discrepancy between (a) and (b). 


8. Gilbert Welch in his book “Should I get tested for cancer?” discusses (among 
many other interesting statistical studies) several mammography studies. He 
mentions in particular a large trial in Edinburgh. There, 45,000 women taken 
care by 84 medical practices were enrolled in the study. Each medical practice 
was randomly assigned to one of two treatments. Either all women in the practice 
had regular mammograms or all the women in the practice had regular clinical 
exams. 


(a) Explain why this is a flawed design. 
(b) What would be a better design? 


Chapter 15 ®) 
The Cumulative Distribution Function hook for 


1 Definition and Examples 


Cumulative distribution functions play a central role in probability theory. As we 
will see in this chapter, they can be used to create new probability distributions from 
old ones. Modeling different phenomena such as wind speed or the time for the next 
earthquake require different models. This is why it is critical to be able to create 
new probability distributions. 

Cumulative distribution functions are especially useful when dealing with con- 
tinuous random variables. The following definition applies to any random variable, 
but we will only study the continuous case. 


e Let X be arandom variable. The function 


F(x) = P(X <x) 


is called the cumulative distribution function or just distribution function of 
X. 


Assume X is a continuous random variable with density f and support (a, b). 
Then, the distribution function F of X is 


F(x) = P(X <x) = / pode, 


for x in (a, b). By the Fundamental Theorem of Calculus if f is continuous at x, 
then F is differentiable at x and 


F(x) = f(x). 


© Springer Nature Switzerland AG 2022 159 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_15 


160 15 The Cumulative Distribution Function 


Hence, the distribution function determines the density and therefore the distribution 
of a continuous random variable. 


Example I Let U be a uniform random on (0, 1). That is, the density of U is f(u) = 
1 for u in (O, 1) and f(u) = 0 elsewhere. 

We now compute F. 

Ifu <0, then F(u) = P(UU <u) =0(ie., U is always positive). If u > 1, then 
Fu) = PU <u) =1 (e., U is always smaller than 1). If 0 < u < 1, then 


Fw)= f° foydx= fo ax=u 
0 0 


Summarizing the computations above we get 


F(u) = Oifu <0 
F(u) =uif0 <u <1 
F(u) =lifu>1 
There are three features of the graph in Fig. 15.1 that are true for all distribution 


functions. We now state these without proof. Let F be the distribution function of a 
random variable X. Then, we have the following three properties: 


e limy-s—o9 F(x) = 0. 
¢ F is an increasing function. That is, if x} < x2 then F(x) < F(x2). 
© limy+to F(x) = 1. 


Example 2 Let T be an exponential random variable with rate 4. What is its 
distribution function? 
The density of T is f(t) = he! fort > 0. Thus, F(t) = 0 fort < 0. Fort > 0, 


t t 
F(t) = f(@)dx = —e | mice, 
0 0 
Summarizing, 
F(t) =0Ofort <0 


F(t)=1—e™ fort > 0 


Fig. 15.1 This is the graph 
of the c.d.f. of a uniform 
random variable 


2 Transformations of Random Variables 161 
2 Transformations of Random Variables 


At this point we know relatively few different continuous distributions: uniform, 
exponential, and normal are the main distributions we have seen. In this section we 
will see a general method to obtain many more distributions from the known ones. 
We start with an example. 


Example 3 Let U be a uniform random variable on (0, 1). Define X = U 2 What is 
the probability density of X? 

Since the support of U is (0, 1) so is the support of X = U?. Let Fy and Fy be 
the distribution functions of X and U, respectively. 


Fx (x) =P(X < x) 
=P 2a) 
=P(U < Vx) 
=Fy (vx). 


Recall that if the probability density fx is continuous at x, then the distribution 
function Fy is differentiable at x and 


d 
— Fy (x) = fx (x). 


dx 
Hence, for x in (0, 1) 
Fx) = < Fu: 
and therefore by the chain rule, 
1 
fx (x) = z7gtuw. 


Since U is uniform on (0, 1), fy(u) = 1 for u in (0, 1). Hence, fy (./x) = 1 for x 
in (0, 1). Thus, 


1 
fx) = ak for x € (0, 1). 


e The preceding example gives a general method to compute the probability 
density of the transformed random variable. We first find the support of the 
transformed random variable. Then, we compute the distribution function of 
the transformed random variable. Assuming the distribution function is regular 


162 15 The Cumulative Distribution Function 


enough (which will always be the case for us) the probability density of the 
transformed variable is the derivative of the distribution function. 


Example 4 (The Chi-Squared Distribution) Let Z be a standard normal random 
variable. What is the density of Y = Z*? 

The support of Z is (—00, +00). Hence, the support of Y = Z? is (0, +00). 

Let y > 0, 


Fy(y) =P(Y < y) 
=P(Z* <y) 
=P(—/y = Z <./y). 
Recall that the standard normal is centered around 0. Hence, 
P(-J/y <Z< Jy) =2P0<Z< Vy). 


Since 


PZS VN =5+POSZS 9), 


1 
Fy(y) =2 (rz < Vy) - ;) 
=2Fz(/y) — 1. 


We now take the derivative with respect to y, by the chain rule 


1 


fy(y) = fz). 
WT 
Since the density of Z is 
1 2 
z(z) = eo, 
I J 20 


the probability density of Y is 


1 
fr(y) = —=y"7e-9/? for y > 0. 
Jf 20 


2 Transformations of Random Variables 163 


This is the density of the so-called Chi-Squared distribution with one degree of 
freedom. 


Example 5 Let T be exponentially distributed with mean 1. What is the distribution 
of X= JT? 
Since the support of T is (0, +00), so is the support of X. Let x > 0, 
Fx (x) =P(X <x) 
=P(/T < x) 
Pr x) 


=Fr(x"). 
By the chain rule, the density of X is 


d 0) 2 
fx(x) = ee )=2xe* forx > 0. 
x 


Next we finally prove a property of normal random variables that we have already 
used many times. 


Example 6 Let X be anormal random variable with mean ju and standard deviation 
o. Show that Y = x— is a standard normal random variable. 

The support of X is (—0o, +00) and therefore so is the support of Y. We compute 
the distribution function of Y. 


Fy({y) =P(Y < y) 


X= 
=p 


<y) 
=P(X <u+oy) 
=Fx(u+ oy). 


Therefore, 


d 
fro) = a ey = fy(utoy) xo. 


Recall that the density of X is 


a (x—p)? 
e ww , 


fx(x) = 


oO 


164 15 The Cumulative Distribution Function 
Thus, 


fy(y) = fx(ut+oy) xo 
1 y2 

De 

on 


This proves that Y is a standard normal random variable. 


Problems 


= 


. Graph the distribution function of a uniform random variable on [—1,2]. 
2. Let X be arandom variable with distribution function F(x) = x? for x in (0,1). 


(a) What is P(X < 1/3) =? 
(b) What is the expected value of X? 


3. Consider a standard normal random variable Z. Use a normal table to sketch the 
graph of the distribution function of Z. 

4. Let U be a uniform random variable on (0, 1). Define X¥ = JU. What is the 
probability density of X? 

5. Let T be exponentially distributed with mean 1. What is the probability density 
of 71/39 

6. Let Z be a standard normal distribution. Find the probability density of X = e”. 
(X is called a lognormal random variable). 

7. Let U be uniform on (0, 1). Find the probability density of Y = In(1 — U). 

8. Let T be exponentially distributed with rate 4. Find the probability density of 
T'/4 where a > 0. (T!/4 is called a Weibull random variable with parameters 
a and 4). 

9, Let X be a continuous random variable, let a > 0 and b be two real numbers, 
and let Y = aX +b. 


(a) Show that 


y—b 


1 
frQy) = —fx( ). 
a 


a 

(b) Show that if X is normally distributed, then so is Y. With what parameters? 

(c) If X is exponentially distributed, is Y = aX + b also exponentially 
distributed? 


3 Sample Maximum and Minimum 165 
3 Sample Maximum and Minimum 


Another important application of cumulative distribution functions is the computa- 
tion of the distribution of the maximum and minimum of a random sample. 


Example 7 Assume that 7; and 7> are two independent exponentially distributed 
random variables with rates 4, and A2, respectively. Let T be the minimum of 7| 
and 7>, what is the distribution of T? 

First observe that since the support of 7; and 7 is (0, +00) so is the support of 
T. Let F be the distribution function of 7. Let t > 0, 


F(t) = P(T <t) = P(min((), 72) < £). 


In order to achieve min(7T;, 72) < ¢ there are three possibilities, T; < ¢ and 
T) > torT; > t and 7, < torT, < t and T% < t. This is why it is better to 
compute the probability of the complement. That is, P(min(7;, T>) > t). Observe 
that min(7|, 72) > t if and only if T; > t and J, > t. Thus, since we are assuming 
that 7; and 7> are independent, 


F(t) =1— P(T, > t)P(h > ft) 
=1-—(—- Fit) — Fo), 


where Fj and F are the distribution functions of T; and 72, respectively. Using the 
distribution function given in Example 2, 


F(t) =1—e Me = 1 — e O12) fort > 0. 


Note that F is the distribution function of an exponential random variable with rate 
A, +Az2. Therefore, the minimum of two independent exponential random variables 
is also exponentially distributed and its rate is the sum of the two rates. 

Next we look at the maximum of three uniform random variables. 


Example & Let U;, U2, and U3 be three independent random variables uniformly 
distributed on [0,1]. Let M be the maximum of U,, U2, U3. What is the density of 
M? 

The maximum of three numbers in (0, 1) is also in (0, 1). Hence, the support of 
M is (0, 1). We first compute the distribution function of M and then differentiate 
to get the density. Let Fy be the distribution function of MW. Then, 


Fy(x) = P(M < x) = P(max(U, U2, U3) < x). 


Note that max(U;, U2, U3) < x if and only if U; < x, U2 < x, and U3 < x. Thus, 
due to the independence of the U;, 


Fy (x) = P(U, < x)P(U2 < x) P(U3 <x). 


166 15 The Cumulative Distribution Function 


Since Uj, U2, and U3 all have the same distribution they have the same distribution 
function. Recall from Example | that P(U < x) = x for0 < x < 1. Hence, 


Fy (x) =x? forO <x <1. 


Thus, the density of M that we denote by fj is 
d 2 ; 
Iu(x) = a fh = 3x~* for x in (0, 1). 
x 


Observe that the maximum of uniform random variables is not uniform! 


Problems 


10. Assume that waiting times for buses 5 and 8 are exponentially distributed with 
means 10 and 20 min, respectively. I can take either bus so I will take the first 
bus that comes. 


(a) Compute the probability that I will have to wait at least 15 min. 
(b) What is the mean time I will have to wait? 


11. Consider a circuit with two components in parallel. Assume that both com- 
ponents have independent exponential lifetimes with means | and 2 years, 
respectively. 


(a) What is the probability that the circuit lasts more than 3 years? 
(b) What is the expected lifetime of the circuit? 


12. Assume that 7, and 7) are two independent exponentially distributed random 
variables with rates 4; and A», respectively. Let M be the maximum of 7; and 
T>, what is the density of M? 

13. Let Uj, U2,..., U, be ni.i.d. uniform random variables on (0, 1). 


(a) Find the probability density of the maximum of the Uj. 
(b) Find the probability density of the minimum of the Uj. 


14. Let X,, X2,..., X, be independent random variables with distribution func- 
tions F,, F,..., Fy, respectively. Let Fingy and Fjyjn be the distribution func- 
tions of the random variables max(X,, X2,..., X,) and min(X 1, X2,..., Xn), 
respectively. 


(a) Show that Fingy = F, Fo... Fh. 
(b) Show that Fiyyj7 = 1—-Q—-Fj)d--)...d—F). 


4 Simulations 167 
4 Simulations 


Consider a continuous random variable X. Assume that the distribution function Fy 
of X is strictly increasing and continuous so that the inverse function Fy ‘is well 
defined. Let U be a uniform random variable on (0,1). Since Fy is an increasing 
function, Fy'(U) < x if and only if Fy(Fy'(U)) < Fx (x). Hence, 


PF, (O) =n) = PU 2 FyG@)). 
Since F(x) is always in [0, 1] (why?), by Example 1 above 
PU < Fx(x)) = Fu (Fx(x)) = Fx (x). 
Hence, 
PUFF (0) 2x) = Fx(@). 


That is, the distribution function of Fy : (U) is the same as the distribution function 
of X. This shows the following. 


e Let X be a continuous random variable with a strictly increasing distribution 
function Fy. Let U be a uniform random variable on (0, 1). Then Fy , (U) has 
the same distribution as X. That is, to simulate X it is enough to simulate a 
uniform random variable U and then compute F “l), 


Random generators are computer programs that simulate uniform random vari- 
ables. The remark above gives a recipe to go from a uniform to any distribution. 


Example 9 A random generator gives us the following 10 random numbers, 
0.38, 0.1, 0.6, 0.89, 0.96, 0.89, 0.01, 0.41, 0.86, 0.13. 
Simulate 10 independent exponential random variables with rate 1. 
By Example 2 we know that the distribution function F of an exponential random 
variable with rate | is 
F(x) =1-e™. 
We compute F~!. If y = 1 — e~*, then 
x =-—In(dl—y). 
Thus, F~!(x) = —In(1 — x). We now compute F7!(x) for 


x = 0.38, 0.1, 0.6, 0.89, 0.96, 0.89, 0.01, 0.41, 0.86, 0.13. 


168 15 The Cumulative Distribution Function 


We get the following ten observations for ten independent exponential rate 1 random 
variables, 4.78, 1.05, 0.92, 2.21, 3.22, 2.21,4.6, 5.28, 1.97, 1.39. 


Example 10 How do we simulate a standard normal distribution? In this case the 
distribution function is 
1 * 2 
F(x) = — / edt. 
V2 Joo 
This is not an expression that is easy to use. Instead we use the normal table. For 
instance, if we want F~! (0.38) we are looking for z such that 


P(Z < z) = 0.38. 


Hence, P(O < Z < —z) = 0.12. We read in the table —z = 0.31, that is, z = 
—0.31. Using the 10 random numbers from Example 6 we get the following ten 
observations for a standard normal distribution, —0.31, -1.28, 0.25, 1.22, 1.75, 1.22, 
-2.33, -0.23, 1.08, -1.13. 


Example 11 Simulate a normal distribution X with mean 4 = 5 and variance 
o2 = 4. We know that if Z is a standard normal distribution, then U+oZ is 
a normal distribution with mean jz and variance o”. We can use the simulation 
of Z in Example 6 to get simulations of X. For instance, if Z = —0.31, then 
X =5+(-0.31) x 2 = 4.38. Here are the 10 observations for a normal distribution 
X with mean yu = 5 and variance o2 = 4. We get, 4.38, 2.44, 5.5, 7.44, 8.5, 7.44, 
0.34, 4.54, 7.16, 2.74. 


Problems 


15. Using the random numbers 0.8147 0.9058 0.1270 0.9134 0.6324 0.0975 0.2785 
0.5469 0.9575 0.9649, simulate 10 observations of a normal distribution with 
mean 3 and standard deviation 2. 

16. Using the random numbers 0.8147 0.9058 0.1270 0.9134 0.6324 0.0975 0.2785 
0.5469 0.9575 0.9649, simulate 10 observations of an exponential distribution 
with mean 2. 


Chapter 16 m®) 
Continuous Joint Distributions hook for 


1 Joint and Marginal Densities 


To compute a probability involving two random variables T; and T) we need the 
joint distribution of (7), 72). In this chapter we will consider only continuous 
random variables. The joint distribution will be given by the joint density of the 
random variables. 


¢ Let X and Y be two continuous random variables. The joint density of the vector 
(X, Y) is a positive function f such that 


+00 ptoo 
/ i f(x, y)dxdy = 1. 


Fora < bandc < d, 


b pd 
Pla <X <bande <¥ <d)= [ / F(x, y)dxdy. 
a c 


More generally, for a function g, 
+00 +00 
E@ixry= ff ee fey dxdy 
—oo J—oo 


provided g > Oor [72° [*2° |g(x, yf, y)dxdy < +00. 
* Note that the joint density f is defined everywhere on R?. It is however non-zero 
only on its support. It will always be important to keep track of the support. 


© Springer Nature Switzerland AG 2022 169 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_16 


170 16 Continuous Joint Distributions 
Example I Assume that (X, Y) is uniformly distributed on the disc 

C=(Q@ yx? ty? SJ. 
What is the joint density of (X, Y)? 


Since the distribution is uniform there is a constant c such that f(x, y) = c for 
(x, y) inC and f(x, y) = 0 elsewhere. That is, the support of f is C. Hence, 


| [ fen a1=ex area), 
GC 
Thus, c = 1/z. 


Note that if the random vector (X, Y) has joint density f, then for any a < b, 


P(a < X <b) =P(a< X < band —-w <Y <+o) 
b p+oo 
=f [ fe.yayas. 
a —oo 


Let 


+00 
fies i. Flx, yay 


then 


b 
P(a< xX <b) =i Ix (x)dx. 


That is, fx is the density of X. We now state this result. 


¢ Let (X, Y) be a random vector with density f. Then, the densities of X and Y 
are denoted, respectively, by fy and fy and are called the marginal densities. 
They are given by 


+00 “+00 
pe / F(x, y)dy and fr(y) = / fe, yar. 


Example 2 We consider again the uniform random vector on the unit disc from 
Example |. What are the marginal densities of X and Y? 
Since x2 + y? < 1, if we fix x in [—1,1], then y varies between —V1 — x? and 


+/1— x2. Thus, 


+00 ls 
f(s) =f reasyydy= [hay —dy 


2 Independence 171 


Therefore, 
2 . 
fx(x) = —V 1 — x? for x in [—1, 1]. 
a 


By symmetry we also have 


2 
= ova eae 
frQ) = ~y! y* for y in[—1, 1]. 


Note that although the vector (X, Y) is uniform the random variables X and Y are 
not uniform. 


2 Independence 


Recall that two random variables X and Y are said to be independent if for any 
a <bandc <d, 


P(a<X <bandc <Y <d)=P(a< xX <b)P(c<Y <a). 


This definition translates nicely into a property of densities that we now state without 
proof (the proof is beyond the mathematical level of this text). 


e Let (X, Y) be a random vector with joint density f and marginal densities fx 
and fy. The random variables X and Y are independent if and only if for all x 
and y, 


f(x,y) = fro) fy). 


Example 3 We continue to analyze the uniform distribution on a disc from Exam- 
ple 1. Are X and Y independent? 

Recall that in this case we have f(x, y) = 1/m onC = {(x, y): x? + 47 <i} 
and 0 elsewhere. We computed fy and fy in Example 2 and clearly 


f(x,y) # fx@ fr). 


We conclude that X and Y are not independent. This is really not surprising. If I 
know for instance that Y is close to 1 then X must be close to 0. Hence, information 
on Y gives me information on X. 


Example 4 Consider two electronic components that have independent exponential 
lifetimes with means | and 2 years, respectively. What is the probability that 
component | outlasts component 2? 


172 16 Continuous Joint Distributions 


Let T and S be, respectively, the lifetimes of components 1 and 2. We want to 
compute P(T > S'). We need the joint distribution of (7, S). Since the two random 
variables are assumed to be independent the joint density is 


f(0,8) = fr(d)fsls) = 5e'e*? fort > 0,5 > 0. 


Note that the support of (T, S) is all (t, 5) witht > 0 and s > 0. We now compute 


1 lee) lee) 
pr > sy=5 f i, ete !*dtds. 
2 s=0 Jt=s 


We first integrate in ¢ and then in s to get 
PT > S)= sf. e Se S/*ds = e 
2 s=0 3 


Example 5 Assume that my arrival time at the bus stop is uniformly distributed 
between 7:00 and 7:05. Assume that the arrival time of the bus I take is uniformly 
distributed between 7:02 and 7:04. What is the probability that I catch the bus? 

To simplify the notation we do a translation of 7h. Let U be my arrival time; it is 
uniformly distributed on [0,5]. Let V be the arrival time of the bus; it is uniformly 
distributed on [2,4]. I catch the bus if and only if U < V. It is reasonable to assume 
that U and V are independent. Hence, the joint density of (U, V) is 


1 
flu, v) = 5X 5 ford <u <Sand0 <v <2. 


Thus, 
Au 3 
PU <V)= [ [3 — x sdludv =) —dv=-. 
v=2 Ju= v=2 10 5 


3 Transformations of Random Vectors 


A consequence of multivariate calculus is the following formula for the density of a 
transformed random vector: 


e Let (X, Y) be arandom vector with density f. Let (U, V) be such that 


U = gi(X, Y) and V = go(X, Y). 


3 Transformations of Random Vectors 173 


Assume that the transformation (x, y) —> (gi(%, y), g2(x, y)) is one to one 
with inverse 


X =h,(U, V) and Y = h2(U, V). 
Then the density of the transformed random vector (U, V) is 
fi, v), hou, v))|J, v)| 
where the Jacobian J(u, v) is the following determinant: 


dh, /du dh /av 
dh2/du dhz/dv 


We now use the preceding formula on an example. 


Example 6 Let X and Y be two independent standard normal distributions. Let U = 
X/Y and V = X. What is the joint density of (U, V)? 

Let u = x/y and v = x. Then, solving in x and y we get x = v and y = v/u. 
Hence, (x, y) —> (u, v) is a one to one transformation from R x R* to R* x R 
where R%* is the set of all real numbers except 0. Therefore, the support of (U, V) is 
all (u, v) where u 4 0. 

We now compute the Jacobian 


0 1 2 
J(u, v) = =v/u’. 
ey) Lae iL / 
Since we assume that X and Y are independent standard normal distributions, 


1 2 1 2 
= —x*/2 —y"/2 
x,y) = —e ——e : 
INGOTS ae? ** Sag 


Therefore, the joint density of (U, V) is for all u 4 0 and all v, 


|u| 
ur" 


1 2 21942 1 —v2 1 
= —v°/2,—v" /(2u*) 3 
flusny= se Cele UG [JG v)| = 5 exp ( 5 (1+ —5)) 


We now use this joint density to get the marginal density of U. We integrate the 
density above in v to get 


_ cael | —y* 1, |v 
fu =f = exp ( 5 d+ )) dv. 


uz 


174 16 Continuous Joint Distributions 


Observe that the integrand above is an even function of v. Thus, 


oO | —v? eee 
fo) =2 f = exp ( 5 (1+ 5) 340. 


Since 
d 2 1 9 , 
“( ier a+ 5) <e0(F0+ 2) 
u 
1 S52 
fu(@u) = j 5 exp ( d+ »))| 
Ti+ yu 0 
Hence, 
folw) : 
u)= — r 
. mwil+u? 


Therefore, the ratio of two standard normal random variables follows the density 
above which is called the Cauchy density. Note that E(U) does not exist (see the 
Problems). The computation above was done for u # 0. The Cauchy density is 
actually defined for all u. 


Example 7 Assume that X and Y are independent exponential random variables 
with rates a and b, respectively. Find the density of X/Y. 

We could set U = X/Y and V = X, find the density of (U, V) and then find the 
density of U. However, in this case since exponential functions are easy to integrate 
we may use the distribution function technique. Let U = X/Y. Note that the support 
of U is (0, +00). The distribution function of U is defined for u > 0 by 

Fy(u) =P <u) 
=P(X/Y <u) 


Since X and Y are independent the joint density of (X,Y) is ae~“*be~”” for 
x > Oand y > 0. Thus, by integrating in y first, 


Fy (u) =PY = X/u) 


ioe) ioe) 
=f / ae~“* be dydx 
x=0 J y=x/u 


ioe) 
= ae eo XM dx 
x 


3 Transformations of Random Vectors 175 


+00 
=; a | 
= pe 
at+ a 0 
a 
a+ e ; 


We now differentiate Fy to get the probability density of U, 


for u > 0. 


fo) =—@ 

u) = ———_ 
u (au + by 
Example & Let Z, and Z2 be two independent standard normal random variables. 
Let X = Z; + Z2 and Y = Z; — Z>. Show that X and Y are independent. 

We first compute the joint density of (X, Y). Let 


xX =2Z, + 22 
y =Z1 — 22 
Hence, 

*( +y) 

— a 9 
Z1 5 y 
ax ) 
2 =a y 


The transformation (z1, z2) —> (x, y) is therefore one to one. That is, for any (x, y) 
in R? we can find a unique preimage (z1, z2). The support of (X, Y) is therefore R?. 

The Jacobian is easily computed and is —1/2. Since Z; and Z2 are independent 
the joint density of (Z1, Z2) is for all (z1, z2) in R?, 


ca M42) 
f(Z1, 22) = ap \  oeek Soe y 
Observe that 
2 bt aan. 2 
wate =-t+y) + —(x — y) 
4 4 
=(x? +?) 
5 : 
Therefore, the joint density of (X, Y) is for all (x, y) in R?, 


_ 1 a) >) 
for») = Ze (—76 +y). 


176 16 Continuous Joint Distributions 


Note that for x and y, 


f(x, y) = g@)h(y), 


where 


g(x) = a exp(—47 ) and h(y) = x exp(—4y" ). 


Using the following result, this is enough to show that X and Y are independent: 


¢ Let f(x, y) be the joint density of (X, Y). If there exist function g and h such 
that for all x and y, 


f(x,y) = gh), 


then X and Y are independent. 


For a proof of this result see the problems. 


Problems 


1. Consider a uniform random vector on the triangle 
{a y):O<5x<y< ]}. 


(a) Find the joint density of the vector (X, Y). 
(b) Find the marginal densities fy and fy. 
(c) Are the two random variables X and Y independent? 


2. Consider a uniform random vector on the square 
{a,y):O0<x<1L0<y<l]}. 


(a) Find the joint density of the vector (X, Y). 
(b) Find the marginal densities fy and fy. 
(c) Are the two random variables X and Y independent? 


3. Let X and Y be two independent exponential random variables with rates A and 
[L, respectively. What is the probability that X is less than Y? 


Problems 177 


10. 


11. 


. Let X and Y be two independent exponential random variables with rates A. 


(a) Find the joint density of (X + Y, X/Y). 
(b) Find the density of X/Y. 
(c) Show that X + Y and X/Y are independent. 


. Let X and Y be two exponential independent random variables with rate 1. Let 


U=X/(X+Y). 


(a) Find the distribution function of U. 
(b) Find the density of U. 


. Let X and Y be two independent uniform random variables. 


(a) Find the density of XY. 
(b) Find the density of X/Y. 


. Let X and Y be two exponential and independent random variables with rate a. 


Let U = X and V = X + Y. Find the joint density of (U, V). 


. Let X and Y be two exponential and independent random variables with rate a. 


Let U = X + Y and V = X — Y. Find the joint density of (U, V). 


. Let T;, T)...T, be independent exponential random variables with rates a1, 


az..., 4y, respectively. Let S = min(7), 7>,..., T,). 


(a) Show that S is exponentially distributed with rate aj + a2 +---+d). 
(b) Prove that § = 7; with probability 


ak 
ay +an+--++an 


Let X and Y be two exponential and independent random variables with rates 
a and b, respectively. Let U = min(X, Y) and V = max(X, Y). 


(a) Show that the joint density of (U, V) is 


fu, v) = ae “be bv 4 ge be for0 <u <v. 


(b) Are U and V independent? 


Let T = U+V where U and V are two independent uniform random variables 
on (0, 1). Let h be the density of T. 


(a) Show that the support of T is (0, 2). 
(b) Show that 


h(t) =t for t in (0, 1) 
h(t) =2 —¢ fort in (1, 2). 


(c) Is T uniformly distributed? 


178 16 Continuous Joint Distributions 


1 1 


x 1402 for all u. 


12. Consider U with a Cauchy density fy (u) = 


(a) Show that fy is indeed a probability density. 
(b) Show that E(U) does not exist. 


13. Two friends have set an appointment between 8:00 and 8:30. Assume that 
the arrival times of the two friends are independent and uniformly distributed 
between 8:00 and 8:30. Assume also that the first that arrives waits for 15 min 
and then leaves. What is the probability that the friends miss each other? 

14. Let f(x, y) be the joint density of (X, Y). Assume there exist functions g and 
h such that for all x and y, 


f(x, y) = g(x)h(y). 


Let 


+00 +oo 
a= | h(y)dy and b = [ g(x)dx. 


—oo —oo 


(a) Show that the marginal densities of X and Y are fy(x) = ag(x) and 
fy (y) = bh(y). 


(b) Show that for all x and y, f(x, y) = fx(x) fyQ). 
(c) Show that X and Y are independent. 


4 Gamma and Beta Random Variables 


We start by defining the function Gamma. 


4.1 The Function Gamma 


¢ The function Gamma is defined for all r > 0 by 


CO 
riy= | x’ le*dx, 
0 


The integral that defines the function Gamma is improper at 0 if r < 1 and 
is improper at +oo for any r. As x approaches 0, x/~!e~* ~ x’7!. Note that 
i x'—!dx converges for all r > 0. Hence, the integral defining Gamma converges 
near 0 for any r > 0. 


4 Gamma and Beta Random Variables 179 


We now turn to the integral near infinity. Note that as x approaches infinity, 
xl len z eT X/2. 


To check this one can show that the ratio 
xl le 

e—*x/2 
converges to 0 as x goes to infinity. Since x’~! 
than e~*/? 


e* is a positive function and is less 
whose integral converges at +00 we conclude that the improper integral 


lee) 
if x’ eX dx 
1 


converges. This proves that the function Gamma is indeed defined for all r > 0. 


¢ For all real numbers r > 0, 
Tlr+l=rIl(r). 


The formula is easily proved by integration by parts, 


lo) 
Tir +1) =| x'e *dx 
0 


=rT(r). 
As the next result shows the function Gamma can be seen as a generalization of 
factorials to positive real numbers. 
¢ For all natural numbers n, ['(n) = (n — 1)! 


The proof of this result is an easy induction and is left to the reader. 


4.2 Gamma Random Variables 


e A random variable X is said to have a Gamma distribution with parameters 
r >Oanda > Oif its density is 
ra 


f@)= To e~**x'—! for x > 0. 


180 16 Continuous Joint Distributions 


0.5 


0.3 


0.2 


0.1 


x 


Fig. 16.1 This is the probability density function of a Gammar.v. with A = 2,r = 3 


See Fig. 16.1 for the sketch the graph of the probability density of a Gamma 
random variable with 7 = 2 andr = 3. 

Note that a Gamma with parameters r = 1| and A is an exponential random 
variable with rate i. 


¢ A Gamma random variable X with parameters r > 0 and A > 0 has 


r 


FCO - and Var(X) = <>. 


We now compute the expected value. The computation of the variance is similar 


and is left to the reader. 
a ee : 
/ ex" dx. 
rr) Jo 


By doing the change of variable y = Ax, 


E(X) = 


AP CO y" 
E(X) = “yd 
” nell, ae 


1 [o.6) 
= —“Yy"dy, 
Tyan ee 


4 Gamma and Beta Random Variables 181 


By the definition of the function Gamma, the integral above is '(r + 1). Hence, 


1 
E(X) “are? + 1) 


where we used that P(r + 1) = rIT(r). 


4.3 The Ratio of Two Gamma Random Variables 


The following result turns out to be useful in several computations. 


¢ Assume that X and Y are independent Gamma random variables with parameters 
(r, A) and (s, 4), respectively. Let U = X/Y. Then the probability density of U 
is given by 


_ Torts) ut! 
fu) = Pore) wap for u > 0. 


Let U = X/Y and V = X. We see that (x, y) —> (u,v) is a one to one 
transformation from (0, 00) x (0, oo) to itself and the Jacobian is v/ u?. The density 
of (X, Y) is 


r s 
us S 


r 
To - aaa Ted = exp(—Ay) for x > 0, y > 0. 


Hence, the density of (U, V) is 


nits 


PUES Reyne) * 


ot exp(—Av)(—)7! exp(—A=) foru >0,v > 0. 

u uu 
Our goal is to compute the marginal density of U and hence to integrate the 
preceding joint density with respect to v. This is why we rearrange the joint density, 


nits 


——— ee ere exp(—A(1 + Ty 
ust lP(r)P(s) uo 


flu, v) = 


Now, for any u > O anda > 0, 


ame | 
[ Tae exp(—yx)dx = 1. 


182 16 Continuous Joint Distributions 


This is so because the integrand is a Gamma density with parameters a and w. 
Hence, 


[o.@) 
iB 
i x?! exp(—yx)dx = (a) 
0 pe 


We use this formula witha =r+s and uw = A(1 4+ 1/u) to get 


a 1 r 
/ v' +5! exp(—A(1 + —)v)dv = — 
0 Uu Arts (1 + a 


Therefore, 


arts T(r+s) 
ust P(r)P(s) arts(d + Lyrts 


ie flu, v)dv = 
0 


Hence, the density of U = X/Y is 


= T(r+s) u’! 
fu(u) = Pore) web foru > 0. 


4.4 Beta Random Variables 


e A random variable X is said to be a Beta random variable with parameters 
r >Oands > Oif it has density 


f= ae x)°—! for x in (0, 1). 


See Fig. 16.2 for the graph of a Beta probability density with parameters r = 2 
and s = 3. 
Moreover, 


r rs 
and Var(X) = 


oS sear (r+s)2(r+s+1) 


We now compute the expected value. The variance will be left to the problems. 
Let X be a Beta distribution with parameters (r, s). Then, 


1 
E(X) = xf (x)dx 
0 


1 
-| EES i 5d, 
0 FOr) 


4 Gamma and Beta Random Variables 183 


Fig. 16.2 This is the * 
probability density of a Beta 0.8 
rv. with r = 2 ands = 3 
oe) 
II 
n 0.6 
~ 
II 
ie 
. 0.4 
> 
= 
s 
Ss 
a 0.2 
0 
0 0.2 0.4 0.6 0.8 1 


Note that for any a > 0 and b > 0, 


1 
[ed x)Pldx = 1, 
0 T@re) 


since the integrand is the density of a Beta distribution with parameters (a, b). 
Hence, 


ae 2 Tia@r(b) 
a-1 b-1 _ 
[: (1 — x) eal COT 


We apply this formula with a = r+ 1 and b = s to get 


; _ GeO) 
r s-l _ 
[ x a-» ads Cera 


Thus, 


Tir+s) f? 

rr)l(s) Jo 

_F@¢+s) Fer pr@® 

~Tr)r(s) Fo +541) 
r 

Pes’ 


E(X)= x (1 = x) "dx 


where we used that P(r + 1) = rT (r) ndTrt+s4+)=(r+s)l(Vr+s). 
Next we state a relation between Gamma and Beta random variables. 


184 16 Continuous Joint Distributions 


e Assume that X and Y are independent Gamma random variables with parameters 
(r, A) and (s, A), respectively. Let 
age 
— X4+Y" 


Then, B is a Beta random variable with parameters r and s. 


We now compute the probability density of B = xy: 

Since X and Y are positive random variables the support of B is (0, 1). We will 
get the density of B through its distribution function Fg. Let t be in (0, 1). Note that 
B can be written as 


U 
B=——, 
U+1 


where U = X/Y. Thus, B < f is equivalent to 


Using this observation, 
Fp(t) =P(B <t) 


t 
=P(U < ——.). 
1-t 
By the chain rule, 


Le eee t = ( t ) 1 
i ae ee 


where fy is the density of U, which was computed in 4.3. Hence, the density fz of 
Bis 


Torts) wu! 1 
Pore) @+b ~ Gn?’ 


fat) = 


where 


After a little algebra we get 


l(ir+s) 
P(r) (s) 


fp(t) = fas) Stortan 0,4); 


This is the density of a Beta distribution with parameters (r,s). 


Problems 185 
Problems 


15. Show that for every natural number n, '(n) = (n — 1)! 
16. Show that the variance of a Gamma random variable with parameters r and A is 


Var(X) = - 


17. Show that f(x) = a e~** x"! for x > 0 is indeed a probability density. 


18. A random variable with density 


1 n 
— ti nf2-1,-x/2 
fO=TRrap  ° 


is said to be a Chi-squared random variable with n degrees of freedom (n > 1 
is an integer). 

Show that a Chi-squared random variable is also a Gamma random variable. 
With what parameters? 

Find the expected value and the variance of a Chi-squared random variable with 
n degrees of freedom. 

19. Show that a Beta random variable with parameters r and s has 


(b 


wm 


rs 


Var(X)= CEO TE 


20. Show that a uniform random variable (0, 1) is also a Beta random variable. 
With what parameters? 

21. Let X and Y be independent and exponentially distributed with rate a. Show 
that X + Y is aGamma random variable. With what parameters? 

22. Let X and Y be two independent exponential random variables with parameters 
a and b, respectively. Assume that a 4 b. Find the density of X + Y. 


23. Let X and Y be independent Gamma random variables with parameters (r, 1) 
and (s, A), respectively. Show that X + Y is also a Gamma random variable. 
With what parameters? 

24. Assume that X and Y are independent Gamma random variables with param- 
eters (r, A) and (s, 4), respectively. Lett U = X/Y. Show that E(U) < +o0 if 
and only ifs > 1. 

25. In this problem we show that ['(1/2) = /7z. 


(a) Explain why 


+00 = 
/ exp(—x2/2)dx = ace 
0 


186 16 Continuous Joint Distributions 


(b) Make the change of variable u = x? in the integral in (a) to get 
+oo 
7 u'/? exp(—u/2)du = V2z. 
0 


(c) Use (b) to show that [(1/2) = /z. 
(d) Let k be a natural number. Find '(k/2). 


Chapter 17 ®) 
Covariance and Independence cen 


1 Covariance 


The covariance is a measure of dependence between two random variables. Next we 
give the definition. 


* Let X and Y be two random variables such that E(X) and E(¥2) are finite. The 
covariance of X and Y is defined by 


Cov(X, Y) = E[(X — E(X))(Y — E(Y))I. 


The definition above is valid for discrete and continuous random variables. In 
this chapter we will only deal with continuous random variables. The covariance 
computations will be based on the linearity of the expectation that we now recall. 


¢ Let X and Y be random variables and a be a real number. Assume that E| X| and 
E|Y| are finite, then 


E(aX) =aE(X), 
E(X+Y)= E(X)+ E(Y). 


These formulas will be proved in a later section of this chapter. 
We now state a computational formula for the covariance. 


¢« A computational formula for the covariance is 


Covu(X, Y) = E(XY) — E(X)E(Y). 


© Springer Nature Switzerland AG 2022 187 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_17 


188 17 Covariance and Independence 


We prove this computational formula. Note first that 
(X — E(X))(Y — E(Y)) = XY — XE(Y) — E(X)Y+ E(X)E(Y). 


By taking the expectation on both sides and using that the expectation is linear we 
get 


Cov(X, Y) =E[XY — XE(Y) — E(X)Y+ E(X)E(Y)] 
=E[XY]—-— E[XE(Y)] — E[E(X)Y]+ EL[E(X)E(Y)] 
Using that E(X) and E(Y) are constants and linearity of the expectation, 
E[XE(Y)] = E(Y)E[X] and E[E(X)Y] = E(X)E[Y]. 
Hence, 
Cov(X, Y) =E(XY) — E(X)E(Y) — E(X)E(Y) + E(X)E(Y) 
=E(XY) — E(X)E(Y). 


The formula is proved. 


Example I Let (X, Y) be uniformly distributed on the triangle 
T={Q,y):0<y<x <2}. 


What is the covariance of (X, Y)? 

It is easy to see that the density of (X, Y) is f(x, y) = 1/2 for (x, y) in 7 and 
0 elsewhere. To get the covariance we need to compute E(X), E(Y), and E(XY). 
There are two methods to compute F(X) and E(Y). Either we compute the marginal 
densities of X and Y and then F(X) and EY) or we can just use the joint density to 
do these computations. The second method is usually faster and that is what we do 
here. We will use the following formula: 


+00 +00 
E(g(X, Y)) = / / EPR DEB 


If E|g(X, Y)| is finite the formula is valid and the order of integration can be 
changed. Using g(x, y) = x, we get 


ope 4 
ea) = [ i x=dydx = =. 


2 Independence 189 


Similarly, 


am ieee 2 
E(Y) =) / y-dydx = =. 


Finally, 


2 fx 1 
Ex) = | / xy=dydx =1. 
o Jo °2 


Therefore, the covariance of X and Y is 


Cov(X, Y) = E(XY) — E(X)E(Y) =1-— ; x ; = : 


2 Independence 


The following is an important property: 


¢ Let X and Y be two independent random variables. For any functions h and g, 


E(g(X)h(Y)) = E(g(X)) EY), 


provided all the expectations exist. 


This property will be proved in the problems. 
A particular case of the property above is g(x) = x and h(y) = y for all x and 
y. We get, 


e Let X and Y be two independent random variables. Then, 
E(XY) = E(X)E(Y), 


and therefore Cov(X, Y) = 0. 


This property tells us that since Cov(X, Y) # 0 in Example 1, X and Y are 
not independent. The converse of this property is not true. That is, we may have 
Cov(X, Y) = O and X and Y not independent. See the example below. 


190 17 Covariance and Independence 
3 Correlation 
We now introduce the notion of correlation. 


¢ Assume that E(X*) and E(Y*) exist. The correlation of X and Y is defined by 


Cov(X, Y) 


Corr(X, Y) = SD(X)SDW)’ 


For any random variables X and Y the correlation between X and Y is always 
in [—1,1]. The correlation between X and Y is —1 or | if and only if there are 
constants a and b such that Y = aX +b. 


The properties above will be proved at the end of this chapter. 


Example 2 Consider the random vector (X, Y), which is uniform on the disc 
C=(Q,y)ix? ty? <I. 
We have shown in a previous chapter that X and Y are not independent. However, 


we will now show that they are uncorrelated. That is, their correlation is 0. 
Recall that the joint density of (X, Y) is f(x, y) = 1/m for (x, y) inC and 0 


elsewhere. 
1 a/ 1—x2 1 
E(XY) = / xy—dydx. 
-1J-VJ1-x2 


Note that when we integrate in y we get 


af 1—x? 
/ — ydy = 0. 
—vV |-x 


Therefore E(XY) = 0. Similarly, 


1 a/1—y? 1 
Ea) = | / x—dxdy =0. 
Pg Pe /\—y? 1s 


By symmetry we also have that E(Y) = 0. Therefore, 
Cov(X, Y) = Corr(X, Y) = 0 
although X and Y are not independent. This is so because correlation measures 


the strength of the LINEAR dependence between X and Y. Here there is no linear 
dependence, but the random variables are dependent in some other way. 


Problems 191 
4 Variance of a Sum 
We now turn to the variance of a sum of two random variables. 
¢ Let X and Y be two random variables. Then, 
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y). 
In particular, if X and Y are independent then 
Var(X + Y) = Var(X) + Var(Y). 
We now prove this formula. By definition of the variance, 
Var(X +Y) = E[(X + Y — E(X +Y))’. 
Note that, 
(XY EK +47)" SR EC PH EW) 


=(X — E(X))’? + (¥ — E(¥)) 
+2(X — E(X))(Y — E(Y)) 


By taking expectations on both sides of the equality above, 


EU(X +Y — E(X +Y))"] =E[(X — E(X))*] + EL(Y — E(¥))*] 
+ 2E[(X — E(X))(Y — E(Y))]. 


Hence, 


Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y). 


Problems 


1. Let X, Y, and U be random variables and a and b be real numbers. Show the 
following properties of the covariance: 


(a) Cov(X, X) = Var(X). 

(b) Cov(X, Y) = Cov(Y, X). 

(c) Cov(ax, Y) = aCov(X, Y). 

(d) Cov(U, aX + bY) =aCov(U, X) + bCov(U, Y). 
(e) Cov(X,a) = 0. 


192 


17 Covariance and Independence 


. Let X, Y be random variables. Show that 


Cov(X — Y, X + Y) = Var(X) — Var(Y). 


. Let f(x, y) =e for0 < x < y be the joint probability density of the vector 


(X,Y). 


(a) Show that f is indeed a probability density. 

(b) Show that F(X) = 1, E(Y) = 2, and E(XY) = 3. 
(c) What is the covariance of (X, Y)? 

(d) Are X and Y independent? 

(e) What is the correlation of X and Y? 


. Let f(x, y) = 8xy forO0 < y < x < | be the joint probability density of the 


vector (X, Y). 


(a) Show that E(X) = 4/5, E(Y) = 8/15, and E(XY) = 4/9. 
(b) What is the covariance of (X, Y)? 
(c) Compute the correlation of (X, Y). 


. Let Z be a standard normal variable. 


(a) Show that Z and Z? are uncorrelated. 
(b) Are Z and Z? independent? 


. Let X and Y be random variables and a and b be real numbers. 


(a) Show that 
Var(aX + bY) = a*Var(X) + b’Var(Y) + 2abCov(X, Y). 


(b) Assuming X and Y are independent, Var(X) = 1, Var(Y) = 4, what is 
Var(X —Y)? 


. Let Y = aX + b where X is a random variable and a ¥ 0 and D are real 


numbers. Show that Corr(X, Y) = 1 or —1 depending on the sign of a. 


. Let Y; and Y2 be two independent random variables. The expectation of Y; and 


Y> is O and their variance is 1. Leta > 0, b > 0, and —1 < c < 1 be real 
numbers. Define 


X, = aY, and X2 = bcY, + bv 1 — c2Y>. 
(a) Show that Var(X,) = a? and Var(X) = b’. 


(b) Show that Cov(X,, X2) = abc. 
(c) What is the Corr(X,, X2)? 


5 Proof That the Expectation Is Linear 193 


9. In this problem we want to find the best linear predictor of Y based on X. 
Let a and b be real numbers. “Best” is with respect to the mean square error 
E[(Y — aX — b)?]. That is, we look for a and b that minimize the function 
f(a, b) = E[(Y — aX — b)’}. 


(a) Show that 
f(a, b) = E(¥*) — 2aE(XY) — 2bE(Y) + a7 E(X’) + 2abE(X) +b’. 
(b) Show that for fixed a the minimum of f(a, b) is attained at 
b(a) = E(Y) —aE(X). 
(c) Show that the minimum of f(a, b(a)) is attained at 


~  Cov(X,Y) 
a= ——_—_ 
Var(X) 
(d) Show that the minimum of f(a, b)i is attained at (a, b(a)). 
We denote b(a) by b and we let Y = 4X +b. We have shown that Y is the 


best linear predictor. — 
(e) Show that Y — Y and Y are uncorrelated. 


10. Let X and Y be two independent random variables. For any functions h and g, 
show that 


E(g(X)h(Y)) = E(g(X)) ERY), 


provided all the expectations exist. 


5 Proof That the Expectation Is Linear 


Let fx be the probability density of X, by the linearity of the Riemann integral, 
+00 
E(aX) =i axfx(x)dx 
—oo 


+00 
=a / x fx (x)dx 


—c 


=aE(X). 


194 17 Covariance and Independence 


We now turn to the second property. Let f(x, y) be the joint density of (X, Y), 
by the linearity of the integral 


extn =f i G4) FG, y)dedy 


= / fs, yyaxdy + f i yf (x, y)dxdy. 


Assuming ae fee |x| f(x, y)dxdy is finite we may interchange the order of 
integration to get 


i ie xf (x, y)dxdy =f oe xf (x, y)dydx 


=) x fx (x)dx 
=E(X) 


Similarly, assuming f°, [°>. ly| f(x, y)dxdy is finite, 


if / yf (x, y)dxdy = E(Y). 


Hence, 


E(X + Y) = E(X) + E(Y). 


6 Proof That the Correlation Is Bounded 


Correlations are standardized covariances. Covariances can be any number in 
(—oo, +00) while correlations are always in [—1,1]. So it is easier to interpret a 
correlation than a covariance. We now prove that correlations are always in [—1, 1]. 
Let X and Y be two random variables such that E(X2) and E(Y2) exist. Assume 
that SD(X) and SD(Y) are strictly positive. Let 


— xX ri Y 
~ §D(X) SD(Y)’ 


6 Proof That the Correlation Is Bounded 195 


Using the formula for the variance of a sum, 


»¢ Y 
Var(U) =Var( ay’ + Var) 


x y 
SD(X)’ SD(y) 


) 


+ 2Cov( 


Recall that Var is quadratic. That is, for any constant a, Var(aX) = a’Var(X). 
In particular, 


: Rs. ff 
ar (D(X) > SD(XY? 


Var(x)=1 


and similarly Var(> By) = 1. Using the bilinearity of covariance, 


Y 1 


Cov)’ SDO) > SD(X)SDW) 


Cov(X, Y) = Corr(X, Y). 


Hence, 
Var) =1+4+1+4+2Corr(X, Y) = 2(1+ Corr(X, Y)). 


Since Var(U) => 0 (a variance is always positive) this yields 1 + Corr(X, Y) > 0. 
That is, a correlation is always larger than or equal to —1. 

Note also that Corr(X, Y) = —1 only if Var(U) = 0. The variance can be 0 
only if the random variable is a constant. That is, 


xX Y 
= + =:¢ 
SD(X)  SD(Y) 


for some constant c. That is, there is a linear relationship between X and Y when 
Corr(X,Y)=-1. 
To prove the inequality Corr(X, Y) < 1 let 


oak xX Y 
~ SD(X) SD(Y)’ 


Doing computations similar to the ones we just did we get 
Var(V) = 201 — Corr(X, Y)). 
Since Var(V) > 0, Corr(X, Y) < 1. Moreover, Corr(X, Y) = 1 only if V isa 


constant. That is, Corr(X, Y) = 1 only if there is a linear relationship between X 
and Y. 


Chapter 18 Mm) 
Conditional Distribution and Expectation xi" 


1 The Discrete Case 


The conditional distribution is defined as follows: 


¢ Let X and Y be two discrete random variables. The conditional distribution of 
X given Y = y is defined by 


P(X =x;Y=y) 


P(X =x|Y=y)= PY=y) 


for all x in the support of X. 


In the preceding definition we need to assume that P(Y = y) > 0. 
A word on notation we use “A, B” and “A; B” interchangeably. It designates the 
intersection of the two events A and B. 


Example I The table below gives the joint density of (X, Y). 


y\x |O |1 |2 


1 1/8 | 1/8 | 1/4 
2 0 1/8 | 1/8 
3 1/8 | 1/8 |0 


The support of X is {0, 1, 2}. The support of Y is {1, 2, 3}. We read in the table 
above that P(X = 1; Y = 2) = 1/8. The marginal distributions of X and Y are 
easily obtained. For instance, 


P(X =1) = P(X =1;Y =1)+ P(X =1;Y =2)+ P(X =1;Y =3). 


© Springer Nature Switzerland AG 2022 197 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_18 


198 18 Conditional Distribution and Expectation 


That is, we sum the column below X = | to get P(X = 1). Ina similar way we 
obtain P(X = 0) and P(X = 2). The marginal distribution of X is 


x 0 1 2 
P(X =x) | 1/4 | 3/8 | 3/8 


By summing rows we get the marginal distribution of Y. 


y 1 2 3 
PY =y) |12 |1/4 | 1/4 


Recall that X and Y are independent if and only if 


P(X =x;Y=y)=P(X=x)PY=y), 


for all x and y. In the present example it is easy to see that X and Y are not 
independent. For instance, P(X = 0; Y = 2) = 0 but P(X = 0)P(Y = 2) £0. 

We now turn to the computation of conditional distributions. By definition of 
conditional probabilities, 


PAH OFS). 178, il 
Pests A 


P(X =0|¥Y =1)= 


Similarly, we get 


1/8 1 
Peeves SS 
1/2. 4 
and 
1/4 1 
P(X=2/Y=1) = = =, 
Vee 


The conditional distribution of X given Y = | is 


x o |1 |2 
P(X =x|Y=1) |14 |14 |172 


As the next example illustrates we sometimes use the conditional distribution of a 
random variable to compute the (unconditioned) distribution of the random variable. 


1 The Discrete Case 199 


Example 2 Assume that the number of customers going into a bank between 2:00 
and 3:00 PM has a Poisson distribution with rate A. Assume that each customer has 
a probability p of being female. What is the distribution of the number of female 
customers between 2:00 and 3:00 PM? 

Let N be the total number of customers and F be the number of female 
customers. We first compute the distribution of F. For any fixed positive integer 


is 
PESfH SP Ven ea 7); 


n>f 


where we use the fact that N > F. By definition of the conditional probability 


P(N =n; F = f) = P(F = f|N =n)P(N =n). 


Given that N = n, F is the number of females among n customers. Each arriving 
customer has a probability p of being female. Assume that arrivals are female or 
male independently of each other. Then, given N = n the number of females F has 
a binomial distribution with parameters n and p. That is, 


P(F = f|N =n)= (*) pld— py. 
Since 
ye 
P(N =n) = exp(-A)—, 
n! 
P(N =n: F = f) =P(F = f|N =n)P(N =n) 


n _ Mn 
= ( A pi (= pF exp(—a)—. 


A little algebra shows that 


(ap)! pe janie 
ES EE ay ee Beat 


Hence, 


PFSfp=) PWHn FS) 


n>f 


Ap)t 1—p)ayr-f 
_ OP) exp(-2) > (( ; a . 


200 18 Conditional Distribution and Expectation 
Finally, note that 


ye pay F _ suds pays 


iF (n— f)! = k! 
=exp((1 — p)A). 
Thus, 
_ » _ Op)! 
oO ee Be fi exp(—A) exp((1 — p)A) 
Ap)t 
_! - exp(—Ap). 


That is, F is also Poisson distributed and its rate is Ap. 


Problems 


1. For the distribution in Example 1, 


(a) Compute the conditional distribution of X given Y = 2. 
(b) Compute the conditional distribution of Y given X = 0. 


2. In Example 2 let M be the number of male customers. Show that F and M are 
independent. 

3. Assume that X and Y are discrete independent random variables. Show that for 
all k and j 


P(X =j,X+Y =k = P(X = j)P(Y =k-— j). 


4. Assume that X and Y are independent binomial random variables with param- 
eters (n, p) and (¢, p), respectively. 


(a) Show that for0 < j <k 


("i") 


(Use that X + Y has a binomial distribution with parameters (n + £, p)). 


P(X = j|X+Y=hb= 


2 Continuous Case 201 


(b) Use (a) to prove that 
j 0 J C J C 


5. Let X and Y be two discrete random variables. Show that for y in the support 
of Y 


Pao Ssy) Sd 


xeS 


where S is the support of X. 
6. Assume that X and Y are independent and Poisson distributed with rates A and 
LL, respectively. Fix a positive integer n. 


(a) Show that for any positive integer k <n 


n r k a n—-k 
Px = KK Y=) = (2) a hap : 


(b) What is the name and the parameters of the conditional distribution in (a)? 


7. Let N be a number picked uniformly in {1, 2, 3}. That is, P(N = k) = 1/3 for 
k=1,2,...,n. Given N = k we toss a fair coin k times. Let B be the number 
of heads among the k tosses. 


(a) What is the conditional distribution of B given N = k? With what 
parameters? 
(b) Compute the (unconditional) distribution of B. 


2 Continuous Case 


We start with a definition. 


e Assume that X and Y are continuous random variables with joint density f. Let 
fy be the marginal density of the random variable Y. Let y be in the support of 
Y (we., fy) > 0). The conditional density of X given Y = y is defined to be 


f(x,y) 
fr(y) 


We will also use the notation f(x|Y = y) for f(x|y) when we will want to 
emphasize that y is a value of the random variable Y. 


fly) = 


202 18 Conditional Distribution and Expectation 


Example 3 Take (X,Y) uniformly distributed on the two dimensional unit disc. 
That is, 


f@y= ~ for x? +y? <1. 


Given y = 1/2 what is the conditional density of X? 
Recall that fy is computed by the formula 


2 


Ji-y? 
fi w= f(x, y)dx. 
n= 


Hence, 


fr) = = 1 = y? fory e(- 1; 1); 


By definition of the conditional density, 


F@, 1/2) 
fy/2)- 
i 
Se fora Py SI 


2/1 — (1/2)? 


F(x|¥ = 1/2) = 


Thus, 


fQxl¥ = 1/2) = YB fo for — 


That is, X conditioned on Y = 1/2 is uniformly distributed on [— oe I, There is 
nothing special about Y = 1/2, of course. One can show that for any y in [—1, 1] 


the conditional density of X given Y = y is uniform on [—,/1 — y?, /1 +4 y?]. 


Example 4 Assume that X is picked uniformly in (0, 1) and then Y is picked 
uniformly in (0, X). What is the distribution of the vector (X, Y)? 

From the information above we get that the conditional distribution of Y given 
X =x is uniform in (0, x). That is, 


fOolx) = - * for y € (0, x). 


We use the formula for conditional density to get 


fay) = fly) fx@) = “ford <y 25S ft, 


Problems 203 


where we are using that fy(x) = 1| for x in (0, 1) and 0 otherwise. Note that the 
vector (X, Y) is not uniformly distributed. 
Is Y uniformly distributed? The marginal density of Y is 


1 1 
fro | fle. ye = | “dx =—Iny for y € (0, 1). 
y y 


Hence, Y is not uniformly distributed either. 

The conditional density may be used to compute conditional probabilities. For 
instance, what is P(Y < 1/4|X = 1/2)? 

Since f(y|X = 1/2) = 2 for y € (0, 1/2), 


1/4 
P(Y <1/4|X = 1/2) = f 2dy = 1/2. 
0 


Problems 


8. Consider Example 4. Compute the conditional density of X given Y = y fora 
fixed y in (0, 1). 
9. The joint density of (X, Y) is given by 


f(x, y) = exp(—y) for0 <x < y. 


(a) Check that f is indeed a density. 
(b) Fix y > 0. Show that the conditional density of X given Y = y is uniform. 
(c) Compute the conditional density of Y given X = x fora fixed x > 0. 


10. Assume that X and Y are continuous random variables. Show that for each y in 
the support of Y, 


+00 
f(alydx = 1. 


11. Let (X, Y) be uniformly distributed on the triangle 0 < x < y <1. 


(a) Find the conditional density of Y given X = x. 
(b) Compute P(Y > 1/2|X = 1/4). 


204 18 Conditional Distribution and Expectation 
3 Conditional Expectation 


The definition below applies equally well to discrete and continuous random 
variables. 


¢ Let X and Y be two random variables. We denote by E(X|Y = y) the expected 
value of the conditional distribution of X given Y = y. The conditional 
expectation of X given Y is denoted by E(X|Y) and is defined to be E(X|Y = 
y) on the event Y = y. 


Note that E(X|Y = y) assigns a value to every y in the support of Y. Hence, 
E(X|Y = y) isa function g(y) and therefore E(X|Y) = g(Y). Since the conditional 
expectation E(X|Y) is a function of the random variable Y, E(X|Y) is also a 
random variable. 


Example 5 In Example 2 the conditional distribution of F given N = n is a 
binomial distribution with parameters n and p. Hence, 


E(F|N =n) =np. 


That is, E(F|N = n) = g(n) where g(n) = np for every positive integer n. 
Therefore, 


Example 6 Consider Example 4. The random variable X is uniformly distributed on 
(0, 1). Given X = x then Y is uniformly distributed on (0, x). Since the expected 
value of a uniform distribution on (0, x) is x /2 we have 

E(Y|X =x) =x/2. 
Hence, 


E(Y|X) = X/2. 


We now list several properties of the conditional expectation. We omit the proofs, 
the interested reader may look at Durrett (1994) for instance. 


P1 For any random variables X and Y, 
E[E(X|Y)] = E(X). 
P2 Assume that X and Y are two independent random variables. Then, 


E(X|Y) = E(X). 


3 Conditional Expectation 205 


P3 The conditional expectation is linear. That is, let U, V, and W be random 
variables and a and b be real numbers then 


E(aU + bV|W) =aE(U|W)+ bDE(V|W). 
P4 Leth bea function and X and Y be random variables then 
E(A(Y)X|Y) =A(VY)E(X|Y). 
P5 Leth be a function and Y a random variable then 
E(h(Y)|¥Y) =h(Y). 


In the next two examples we show how property P1 can be used to compute an 
unconditional expectation. 


Example 7 We go back to Example 4. Recall that E(Y|X) = X/2. By PI, 
E[E(Y|X)] = E(Y). 


Hence, E(Y) = E(X/2). Using that X is uniformly distributed on (0, 1), E(X) = 
1/2. Therefore, 


E(Y) = E(X/2) = 1/4, 


We now check property Pl by computing E(Y) directly. We have already 
computed 


fy(y) = —Iny for y € (0, 1). 


Doing an integration by parts, 
E(Y) [ MS Wed fx “y' : 
— _ n — a n _ —. = SS 
: yimydy 5” y oo ods yay 4” ca 


Hence, we do have E[E(Y|X)] = E(X). 


Example & Let N and F be two discrete random variables. Assume that the 
conditional distribution of F given N = n is binomial with parameters n and p. 
Hence, E(F|N =n) =np and E(F|N) = pN. By Pl, 


E(F) = E[E(F|N)] = E(pN) = pE(N). 


206 18 Conditional Distribution and Expectation 
3.1 Conditional Expectation and Prediction 


e Let X and Y be two random variables such that X has a finite second moment. 
The minimum of 


E[(X — h(¥))"] 
over all functions ) such that h(Y) has a finite second moment is attained for 
hA(Y) = E(X|Y). 


In words, the best way to predict X using Y is the conditional expectation 
E(X|Y). Note that our definition of “best” is with respect to the mean quadratic 
distance E[(X—h(Y))?] (this is why we need X and h(Y) to have a second moment). 
We now prove this result. 

Define g as 


e(Y) = E(X|Y). 
We have 
E[(X —h(y))?] =EU(X — g(Y) + 9(Y) —hGQ))7] 


=E[(X — g(y))*] + El(g(Y) —h(y))7] 
+ 2E[(X — g(y))(g(Y) —A(y))] 


We will show that the double product is 0. Once this is done, since E[(g(Y) — 
h(y))7] is positive, 


E[(X — h(y))"] = E[(X — g(y))1, 
for all h. This shows that the minimum value of E[(X — h(y))?] is attained for 
, We now show that the double product is 0. By P4 
E((h(Y) — g(Y))X|¥] =(A(Y) — g(Y)) E(X1Y) 
=(h(Y) — g(Y))g(Y). 


By taking expectations on both sides and using P1, 


E((A(Y) — g(Y))X] = Eh) — g))g(V)I. 


Problems 207 


We move the right hand side to the left and use the linearity of expectations to get 
E((A(Y) — g(Y))(X — g(Y))] = 9. 


This proves that the double product above is 0 and we are done. 


Example 9 Consider Example 4 again. The random variable X is uniformly 
distributed on (0, 1). Given X = x then Y is uniformly distributed on (0, x). What 
is the best predictor of Y of the form h(X)? 
We know that the best predictor of Y based on X is E(Y |X). Since 
E(Y|X) = X/2 


the best predictor of Y is X/2. 


Problems 


12. Consider X and Y with joint density 
f@,y)=x-+yfor0<x <land0<y <1. 


(a) What is E(X|Y)? 
(b) What is E(X)? 


13. Consider X and Y with joint density 
f(x, y) = 8xy forO< y<x <1. 


(a) What is E(X|Y)? 
(b) What is E(XY7|Y)? 


14. Consider Example 4. Show that the best predictor of X based on Y is 


1 
———(1-Y). 
ny ) 


15. Assume that X and Y are discrete random variables. 
(a) Show that 
E(X|Y) = g(Y), 


where 


1 
g(y) = Pap 


208 


16. 


17. 


18. 


18 Conditional Distribution and Expectation 


(b) Prove property P1 in the discrete case. 
(c) Prove property P2 in the discrete case. 


Consider a sequence X1,..., Xn,... Of ii.d. discrete random variables with 
expected value jz. Let N be a random variable taking values on natural numbers 
and independent of X1,..., Xn,.... Define Y as 


so Y is asum of a random number of random variables. 


(a) Show that 
P(Y =k|N =n) = P(X1+---+ Xn =k). 
(b) Use (a) to show that 
E(Y|N =n) =nwu. 


(c) Compute E(Y). 
Let X and Y be two random variables such that E(X|Y) is a constant c. 


(a) Show that c = E(X). 
(b) Show that 


E(XY) = E(X)E(Y). 


(c) Show that X and Y are uncorrelated. 
Let X and Y be independent and having the same distribution. 


(a) Show that 

E(X|X + Y) = E(Y|X +Y). 
(b) Explain why 

E(X+Y|X+Y)=X-+Y. 


(c) Use (a) and (b) to show that 


E(X|X+Y)= 5X + Y). 


Chapter 19 m®) 
The Bivariate Normal Distribution hook for 


1 The Correlation 


We now introduce a bivariate normal vector. 


e Let Z; and Z2 be two independent standard normal random variables. Let 0, > 
0, 02 > 0, —1 < p < 1, 1, and jZ2 be real numbers. Define 


X, =0,Z,; + py and X72 = on PZ, + 02,/1 — p2Zo + p2. 


Then, (X1, X2) is said to have a bivariate normal distribution with parameters 
41, 42, 0], 02, and p. 

e Assume (Xj, X2) is a bivariate normal vector with parameters 1, 2, 01, 02, 
and p. Then Xj is normally distributed with mean jz; and variance of and X> is 


normally distributed with mean j12 and variance oF. 


It is easy to see that X; is normally distributed as a linear transformation of a 
normal random variable Z;. The fact that X2 is normally distributed comes from 
the following property of the normal distribution. A linear combination of two 
independent normal variables is also normal. This will be proved in the moment 
generating chapter. Since X> is a linear combination of the two independent normal 
random variables Z; and Zz, X2 is also normally distributed. 

We will show the following properties of a bivariate normal: 


¢ Consider a bivariate random vector (X,, X2) as defined above. Then, E(X,) = 
by, E(X2) = po, Var(X1) = cre Var(X2) = Ge, and the correlation of 
(X1, X2) is p. 


We now perform these computations. We will use that Z; and Z2 are indepen- 
dent, E(Z,;) = E(Z2) = 0, and Var(Z,) = Var(Z2) = 1. Note that these 


© Springer Nature Switzerland AG 2022 209 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_19 


210 19 The Bivariate Normal Distribution 


computations hold for any distributions for Z; and Z2. The fact that Z; and Z2 
are normally distributed plays no role at this point. 

Using that E(Z,) = E(Z2) = 0 and the linearity of the expectation it is easy to 
see that E(X,) = my, and E(X2) = p2. This is left to the reader. 

Using that the variance is shift invariant and is quadratic, 


Var(X1) =Var(o,Z, + 11) 
=Var(o1Z1) 


=o; Var(Z1) 


Fad 


We now turn to Var(X2), using the independence of Z; and Zo, 


Var(X2) =Var (coz + 02,/1 — p2Z2. + 12) 
=Var (o.pZ,) + Var (coy 1—p?Z.4+ 1a) 


=o}, p’Var(Z\) + o3(1 — p?)Var(Z2) 


Se 
=07. 


We now compute the covariance of X; and X2. Recall that the covariance is shift 
invariant and linear in both components, hence 


Cov(X1, X2) =Cov (az + M1,020Z1 + 02/1 — p?Z2 + 1a) 


=Cov (az 020Z, + 02/1 — 22) 


= 0102pCov(Z1, Z|) + 01024) 1 — p*Cov(Z), Z2) 
Using now that Cov(Z, Z,) = Var(Z,) = 1 and that Cov(Z1, Z2) = 0 we get 
Cov(X,, X2) = ojo2p. 
By definition 


Cov(X, X2) 
ee ee re) 


0102 


Corr(X1, X2) = 


2 An Application 211 


An interesting consequence of the computations above is the following. 


¢ Let (X1, X2) be a bivariate normal vector then X; and X2 are independent if and 
only if their correlation is 0. 


We already know that if X; and X2 are independent, then their correlation p = 0. 
This is true for any two random variables. Conversely assume that p = 0. Going 
back to the definition of (X1, X2) we get for p = 0, 


X, = 01Z, + wy and Xp = 0227 + M2. 


Note that X; is a function of Z; only and X> is a function of Z> only. Since Z; and 
Z> are independent so are X; and X. 


2 An Application 


Francis Galton in the 1800s looked at heights of fathers and sons. He noticed that the 
distribution of heights for fathers was normal and so was the distribution of heights 
for sons. He also noticed that the height of a son is not independent of the height of 
his father. 

Let (X1, X2) be a bivariate normal vector. Let X1 be the height of a father and 
X2 be the height of his son. Let wy = 175 cm, “2 = 178 cm, oj = 02 = 5 cm, and 
p= 0.6. 

What is the probability that a son is taller than his father? That is, what is P(X2 > 
X\)? 

Going back to the definition of (X1, X2), 


P(X7 > X1) =P(o2pZ, + ony) 1 — p?Z2 + wz > 01 Z1 + 11) 


=P((o2p — 01)Z + 02\/1 — p?Z2 > 1 — 12). 


Let 


Y = (029 — 01)Z, + 02/1 — p? Zp. 


Since Y is a linear combination of the two independent normal random variables Z 
and Z> we see that Y is also normally distributed. Since E(Z,) = E(Z2) = 0, we 
get E(Y) = 0. Using that Z; and Z> are independent with variance 1, 


Var(Y) = (o2p — 01)° + 03 (1 — p”). 


In this application we are assuming 0) = o2 = 5 and p = 0.6, hence the variance 
of Y is 32. 


212 19 The Bivariate Normal Distribution 


Since Y/SD(Y) is a standard normal Z, 


P(X2 > X1) =P(Y > wi — pa) 
175 — 178 
V32 


Using the normal table we find that P(X2 > X 1) = 0.7. That is, there is a 70% 
chance that the son is taller than the father. 


2.1 Best Predictor 


Given the height of the father what is the best predictor for the height of the son? 

We have seen that given X , the best predictor for Xz is the conditional 
expectation E(X2|X,). We now compute this conditional expectation for a normal 
bivariate vector. Using the definition of (Xj, X2), 


E(X2|X1) =E(o2pZ, + 02/1 — p?Z2 + 2|X1) 
=072pE(Z\|X1) + 02/1 — p?E(Z2|X1) + wo 
Since X; = 0;Z, + “1, knowing Z, is equivalent to knowing X,. Hence, 

E(Z,|X1) = Z,. Recall that Z; and Z2 are independent. Thus, X; and Z> are 
independent as well. Hence, 

E(Zo|X1) = E(Z2) = 0. 
Therefore, 

E(X2|X1) = 020Z1 + po. 


Using that X; = 012, + 1, 
02 02 
E(X2|X1) = —pX1 + w2- =p. 
oO! oO! 
That is, the best predictor of X2 based on Xj is a linear function of X1. 
Going back to our application with ~; = 175 cm, “2 = 178 cm, o1 = o2 = 5 


cm, and p = 0.6 we get 


POG) = 06K = 


Problems 213 


Using the conditional expectation we can answer the following question. When 
is the son predicted to be taller than his father? 
The son is predicted to be taller than the father if and only if 


0.6X; +73 > X. 


That is, if and only if X; < 182.5. If the father is taller than 182.5 cm, then the son 
is predicted to be shorter than the father. On the other hand if the father is shorter 
than 182.5 cm, then the son is predicted to be taller than the father. This is what 
Galton called regression toward the mean. 


Problems 
1. Let 


X, =01Z, + wy and X2 = 029 Z, + 02,/1 — p2Zo + 2. 


Show that E(X,) = my, and E(X2) = p2. 

2. Let (X1, X2) be a bivariate normal vector. Let X; be the height of a father and 
X2 be the height of his son. Let w; = 175 cm, 2 = 178 cm, 0) = o2 = 5 cm, 
and p = 0.6. 


(a) What is the probability that a son is taller than 180 cm? 

(b) What is the probability that a father is shorter than 170 cm? 

(c) What is the probability that the father is at least 2 cm taller than the son? 
(d) What is the probability that the son is at least 5 cm taller than the father? 


3. Use the hypotheses of Problem 2 to compute the following: 


(a) Assume that the father is 170 cm tall. What is the predicted height for the 
son? 

(b) Assume that the father is 190 cm tall. What is the predicted height for the 
son? 


4. Let X; and X> be the scores of a student in the first and second test in 
a probability class. Assume that (X1, X2) is a bivariate normal vector. Let 
yy = 75, "2 = 65, o; = 10, oo = 15, and p = 0.5. 


(a) What is the probability that a student gets a 60 or lower on the first test? 
(b) What is the probability that a student gets a 70 or higher on the second test? 
(c) Given that a student got a 70 on the first test what is the predicted score on 
the second test? 

What is the probability that a student gets a higher score on the second test 
than in the first test? 


(d 


wa 


214 19 The Bivariate Normal Distribution 
3 The Joint Probability Density 


We compute the joint probability density of (X1, X2). Let 


Xp =0171 + My 


X2 =02PZ1 + 02,/ 1 — p*z2 + M2. 


We solve the system of equations above in (z1, Z2), 


1 
Z1 =— (x1 — 41) 
O1 


1 
p ba) + (x2 — (2) 


(1 
o1V/1— pp? 02 1— p? 


Hence, the transformation (z1, z2) —> (%1, x2) is one to one from R? to R?. 
The Jacobian of this transformation is 


2 


1 
7 0102/1 = pp. 


Since Z; and Z2 are independent standard normal random variables the joint 
density of (Z1, Z2) is for all (z1, z2) € R* 


= 1 1 , > 
P21, 22) = 5— exp — 3 ei + 2) . 


A little algebra shows 


2 >) 1 1 >) 1 
re ee Shi 5 | 1 — M1)" — 20 (x1 — #1)@2 — H2) 
1- p om 0102 


; ( ye) 
Zs OOS 02 -- : 


Hence, the joint density of (X,, X2) is for all (41, x2) € R?2 


1 1 
f 1, %2) = aaa exp ( a1 — #0182) 


where 


1 1 1 
B(X1, 2) = (a — oe)” — 20— (1 — 1) 2 — Ma) + (2 — 2)”. 
O7 0102 O04 


4 The Conditional Probability Density 215 
4 The Conditional Probability Density 


Let (X1, X2) be a bivariate normal vector. We will compute the conditional 
probability of X2 given X;. Using the definition of the conditional density, 


f (x1, x2) 


f (x2|x1) = GaGa 


Since X; = 0;Z; + m1 and Z, is a standard normal it is easy to see that X, is 
normally distributed with mean jz; and standard deviation o;. Using the density of 
X, and the density of (X1, X2) (computed in the previous section), 


1 
f (X2|41) = —= === &xp(A), 
oo 21/1 — p 
where 
1 1 ; 1 1 
A= a | —y 1 — M1)" — 2p (x1 — M1) (2 — 2) + 2 — H2) 
2(1 — p*) \o; 0102 05 
ee y 
ame * 
do? 1/1 
Note that 
1 1 2 1 2 pr 2 
x + — (x1 - = x A 
x0 — p) (2 1 a) ag? 1— /41) @= pie 1— /1) 


Using this last expression in A, 


1 
2a3(1 — p?) 


1 ( a ») 
x2 —H2— Pp X] eal 
203(1 — p?) 02 


2 

0. O71 

2p? (x1 — wa)? — 20— (x1 — M1) (2 — M2) + (2 — M2)? 
oO; 02 


Hence, 


7 1 1 o1 i 
f (x2|x1) = Sina? aa o( 2o2(1 — 9) (x [2 ee (x1 7) ) 


This yields the following result. 


216 19 The Bivariate Normal Distribution 


¢ The conditional distribution of X2 given X; = x, is normally distributed with 
mean 


O1 
M2 + p—(x1 — 11) 
02 
and variance 
o3 (1 — p’). 


Note that this is consistent with the computation of the conditional expectation 
done in Sect. 2, 


o2 02 
E(X2|X1) = —pX1 + wa - — pp. 
al onl 


Going back to our application we can answer questions such as the following. 
Given that the father is 180 cm tall what is the probability that the son will be at 
least 185 cm tall? 

We use 4) = 175, “2 = 178, 0) = 02 = 5, and p = 0.6. Given that x; = 180, 


oO} 
M2 + p—(1— #1) = 181 
02 


and os (1 _ p) = 16. Hence, the conditional distribution of X2 given X; = 180 is 
normally distributed with mean 181 and variance 16. By the normal table, 


185 — 181 
ya 0.46. 


P(X> > 185|X; = 180) = P(Z > 
eee Vié 


Problems 


5. Let (X1, X2) be a bivariate normal vector. Let X1 be the height of a father and 
X?2 be the height of his son. Let 7; = 175cm, “2 = 178 cm, 01 = o2 = S5cm, 
and p = 0.6. Given that the father is 190 cm tall what is the probability that the 
son is shorter than 185 cm? 

6. Let X; and X2 be the scores of a student in the first and second test in 
a probability class. Assume that (X1, X2) is a bivariate normal vector. Let 
by = 75, 2 = 65cm, o; = 10, oo = 15cm, and p = 0.5. 

Given that a student got a 70 on the first test what is the probability that she gets 
a 75 or higher in the second test? 

7. Let Z; and Z2 be two independent standard normal random variables. Consider 
(Z1, Z2) in polar coordinates. That is, Z; = Rcos © and Z2 = Rsin © where 
R > Oand @ is in (0, 27). 


Problems 217 


10. 


(a) Find the joint density of (R, ©). 
(b) Show that R and © are independent. 
(c) Show that © is uniformly distributed. On what support? 


. Consider a bivariate normal vector (X, X2). That is, 


X, =01Z, + wy and X2 = 07pZ) + 02/1 — p?Z2 + br, 


where Z; and Z>2 are independent standard normal random variables. 


(a) Find the conditional density f(x1|x2). (Use that X2 is normally dis- 
tributed.) 

(b) Use (a) to find E(X,|X2). 

(c) The predictor of X2 given Xj is a linear relation between X2 and X). So is 
the predictor of X; given X2. Are the two linear relations equivalent? 


. Let (X1, X2) be a bivariate normal vector with joint density f (x1, x2). Let p = 


Corr(X,, X2) = 0. 
(a) Show that there are functions g and h such that for all x; and x2, 
f(x1, x2) = gx )h(x2). 


(b) Use (a) to show that X; and X> are independent when p = 0. 


Let (X1, X2) be a bivariate normal vector with joint density f (x1, x2). Integrate 
the joint density to find the marginal density of X2. 


Chapter 20 
Sums of Bernoulli Random Variables hook for 


In many cases the distribution of a random variable is too involved to be computed. 
In some of those cases it is possible to break up the random variable into a sum 
of Bernoulli random variables. This is what we did to compute the mean of a 
binomial random variable. In the binomial case the Bernoulli random variables are 
independent and identically distributed. In the next examples the Bernoulli random 
variables will be either dependent or not identically distributed. 

Recall that X is said to be a Bernoulli random variable if X takes only the values 0 
and 1. The distribution of X is given by and P(X = 0) = 1— pand P(X = 1) = p, 
where p is a parameter in (0, 1). 


1 The Expected Number of Birthdays 


Let B be the number of distinct birthdays in a class of 50 students. What is the 
expectation E(B)? 

The distribution of B is clearly fairly involved. We will compute the expected 
value without computing the distribution of B. To do so we write B as a sum of 
Bernoulli random variables. Set X; = 1 if at least one student was born on January 
1, otherwise set X; = 0. Set X2 = 1 if at least one student was born on January 2, 
otherwise set X27 = 0. We define X; as above for every one of the 365 days of the 
calendar. We claim that 


B=X\+X2+-:-+ X365. 


This is so because the r.h.s. counts all the days on which at least one student has a 
birthday. 


© Springer Nature Switzerland AG 2022 219 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_20 


220 20 Sums of Bernoulli Random Variables 


Note that the X; are Bernoulli random variables. In order for X; = 0 we must 
have that none of the 50 students was born on January 1. Thus, 


364 
P(X; = 0) = (—_)”. 
(xy ) (365? 


Hence, 


=P =) =1-(e)™ 
ar ae eae 
All the X; have the same p (which is also the expected value of a Bernoulli random 


variable). By the addition rule, 


E(B) =E(X1) + E(X2) +--+ + E(X365) 
=365p 


364 
= 1—-(— 50 ; 
365(1 — (SE) ) 


Numerically, we get 
E(B) = 46.79. 


Observe that the maximum value for B is 50. We have seen in Chap. | that the 

probability of having two students share the same birthday is about 0.96. This is 

also the probability that B < 49. On the other hand E(B) is close to 50. This shows 

that one common birthday is likely but multiple common birthdays are unlikely. 
Next we give another application. 


Example I Assume that 3 people enter independently in an elevator that goes to 5 
floors. What is the expected number of stops S that the elevator is going to make? 
Instead of computing the distribution of S we break S into a sum of 5 Bernoulli 
random variables as follows. Let X; = 1 if at least one person goes to floor 1, 
otherwise we set X; = 0. Likewise let X2 = | if at least one person goes to floor 2, 
otherwise we set X27 = 0. We do the same for the 5 possible choices. We have 


S=X,+X2+---+Xs. 
Note that X; = 0 if none of the 3 people picks floor 1. Thus, 
P(X, =0) = (4/5)°. 


Hence, p = P(X) = 1 =1- (4/5)°. All the X; have the same Bernoulli 
distribution. By the addition rule, 


E(S) =5p =5(1 — (4/5)") = 35 = 2.44. 


2 The Matching Problem 221 


We now compute E(S) by using the distribution of S. The random variable S$ 
may only take values 1, 2, and 3. In order to have $ = 1, the second and third 
person need to pick the same floor as the first person. Thus, 


PS=1) as) 


To have S = 2, exactly two people pick the same floor and one picks a different 
floor. There are three different possibilities, all yielding the same probability. Hence, 


P(S = 2) = 3(1/5) (4/5). 
Finally, S = 3 happens only if the three persons pick distinct floors: 
P(S = 3) = /5)G/5). 
Thus, 


E(S)=1 ee 2 3 mena 

Sp gee go se Om 

Even in this very simple case (S has only 3 values after all) it is better to compute 
the expected value of S by breaking S in a sum of Bernoulli random variables rather 
than compute the distribution of S. 


2 The Matching Problem 


We have n boxes and n balls, both are numbered from | to n. We put every ball in a 
box at random. Every box has exactly one ball. We say that we have a match if the 
ball number coincides with the box number. The number of matches is therefore a 
random variable that can take values 0, 1,...,n. 


2.1 Expected Number of Matches 


What is the expected number of matches? 
Let X, = 1 if ball number 1 is in box number 1. Let X; = 0 otherwise. Note that 
there are n! ways to put n distinct balls in n distinct boxes. Hence, 


_(@- i! 1 


P(X; =1) 7 . 


222 20 Sums of Bernoulli Random Variables 


This is so because when ball 1 is in box | there are (n — 1)! ways to put the other 
n — | balls inn — 1 boxes. More generally, for 1 <i <n we set X; = 1 if ball 
number i is in box number i and X; = 0 otherwise. Using the same argument as 
above, P(X; = 1) = 1/n. Let M,, be the number of matches. Then, 

Mn = Xi +---+ Xn. 
Since E(X;) = 1/n for every i, 

E(M,) = nE(X1) = 1. 
This is a remarkable result. The expected number of matches is 1 irrespective of 
what 7 is. 


Next we would like to compute the variance of the number of matches. To do so 
we need a formula for the variance of a sum. 


2.2. Variance of a Sum 


e¢ Let Xi, X2,..., Xp, be a sequence of n random variables. Then, 
n 
r (x)=) So vertxy +29) ~ Cov(X;, Xj). 
i=l i=L j=i4+1 
Note that for n = 2 we get 
Var(X, + X2) = Var(X1) + Var(X2) + 2Cov(X1, X2). 
For n = 3, 


Var(X, + X2 + X3) =Var(X1) + Var(X2) + Var(X3)+ 
2 (Cov(X1, X2) + Cov(X1, X3) + Cov(X2, X3)). 


The proof of the formula is based on the following algebra identity that can be 
proved by induction on n. Let a1, a2, ..., a, be n real numbers. Then, 


(xa) =D 5" ae, 


t=) J=i+1 


2 The Matching Problem 223 


Using this formula for a; = X; — E(X;), 


n 2 n 
(de - exp) =) (Xi — E(Xi))” 


i=l i=l 
n—-l1 on 


ve oy: s° (X; — E(X;))(Xj — E(X;)) 


tel jet) 


Taking expectations on both sides of the equality yields, 


n n n-1 n 
Var (> x) =) Var(X;)+2>> >> Cov(X;, Xj). 
i=l i=1 


t=1 JSi+1 


This proves the formula. 
A special case of this formula will be particularly useful. 


e Let Xj, X2,...,Xn be a sequence of n random variables. Assume that 
Var(X1) = Var(X2) =--- = Var(X,) and that Cov(X;, Xj) = Cov(X1, X2) 
for alli # j. Then, 


Var (> x) = nVar(X1) +n(n — 1)Cov(X}, Xo). (1) 
i=l 


This formula is easy to see. Since all the variances are equal the sum of the 


n variances is just nVar(X,). There are (5) ways of picking i # j and each 


Cov(X;, X ;) appears twice in the expansion. Hence, the sum of covariances is 


2(3)coux, Xo) =n(n — 1)Cov(X1, X2). 


2.3. Variance of the Number of Matches 


Recall that for 1 < i < n X; = 1 if ball number i is in box number i and X; = 0 
otherwise. We showed that P(X; = 1) = 1I/n. Hence, E(X?) = 1/n and for all 
l<i<n 


Var(X;) =E(X2) — E(X;)? 


224 20 Sums of Bernoulli Random Variables 


We now consider the distribution of X 1X2. Clearly X; Xz can only take values 1 
and 0. In order to have X; X2 = 1 we need ball | in box 1 and ball 2 in box 2. There 
are (n — 2)! to place the other m — 2 balls in n — 2 boxes. Hence, 


(@-2! 1 


nko n(n—1)° 


P(X,X2 =1)= 


Therefore, E(X|X2) = WRT and 


Cov(X1, X2) =E(X1X2) — E(X1) E(X2) 
ol ie 
=a ae 
ani 
~n2(n — 1) 


The same computation applies to all covariances. Since all variances are equal and 
all covariances are equal we can apply formula (1), 


Var(M,,) =Var (» x) 


i=1 
=nVar(X,1) +n(n — 1)Cov(X1, X2) 


a ee fie 
aa are San ae a= = 1? 


=1. 


The variance for the number of matches is 1 (as is the expected value). 


3 The Moments of the Hypergeometric 


We have seen already this distribution in the Combinatorics chapter. We did there 
a direct computation of the mean and variance of a hypergeometric distribution. In 
this section we will use Bernoulli random variables to get a quick derivation of these 
quantities. 

First we recall the distribution. Consider an urn with b blue marbles and r red 
marbles. Assume that we take randomly n < b +r marbles from the urn without 
replacement. Let H be the number of blue marbles in the sample. Then, H is said 
to have a hypergeometric distribution with parameters b, r, and n. 


3 The Moments of the Hypergeometric 225 


For 1 <i <n, let X; = 1 if the i-th draw is a blue marble. Let X; = 0 otherwise. 
It is easy to see that the total number of blue marbles after n draws is 


jae ee ere Sa 


By the symmetry of sampling without replacement it turns out that all X; for 
i = 1,2,...,n have the same distribution. Likewise all the vectors (X;, X ;) for 
i ~ j have the same distribution. The random variables 


(X1, Xo, seen Xn) 


are said to be exchangeable. We will not prove these claims but we will check that 
they are true on some particular examples in the problems. 
Note that the X; are Bernoulli random variables and 


b 
P(X, = 1) = ——. 
(X1 ) r+b 


Since all the X; have the same distribution, 


E(H) =E(X1) +--+ + E(Xn) 
=n E(X)) 


We now turn to the variance of H. 
Note that the product XX take only values 0 and 1. The product takes value 1 
if and only if X; = | and Xz = 1. Hence, 


P(X, X72 = 1) = P(X. = 1X, = 1) P(X = 1). 


Given that X; = 1, 


b-1 


(%2= VX, =)= ey ee 
Thus, 


b(b — 1) 
(+b +b— 1)’ 


P(X,X2=1)= 


226 20 Sums of Bernoulli Random Variables 


Therefore, 


Cov(X1, X2) =E(X,X2) — E(X))E(X2) 


2 b(b — 1) b 2 
“(r+b\r+b—1) or+b 
br 


(b+ry26+r—1 


Note that we should expect X; and X>2 to be negatively correlated. If the first draw 
is blue, then there is one less blue marble for the second draw. 

Using that all the vectors (X;, X ;) have the same distribution, all the covariances 
are the same. Note also that 


Var(X}) b ( b 4 br 
ar = — 
ee r+b Or+b b&b +ry 


Since all the variances Var(X;) are equal we can apply formula (1) to get, 


Var(H) =nVar(X1) +n(n — 1)Cov(X1, X2) 
br ( b br 
=n n(n 
(b+r)2 (b+r)2(b+r—1) 
nbr b+r-n 
= x s 
(b+r)2 b+r-—-1 


4 The Number of Records 


Consider a sequence W), W2,... of independent identically distributed random 
variables. Moreover, assume that the common distribution of these random variables 
is continuous. In particular, ifi A j then P(W; = W;) = 0. 

A record is said to occur at time n > 1 if 


W,, = max{W,..., W,}. 
Note that according to the definition there is always a record atn = 1. Let Ry, 
be the number of records up to time n. Let X; = | if there is a record at time i and 
X; = 0 otherwise. It is easy to see that 


Ry =X, +-++ Xp. 


We will use this representation of R, as a sum of Bernoulli random variables to 
compute its mean. 


4 The Number of Records 227 


¢ Forn > 1, the probability that there is a record at time n is 


1 
POG Se 


* 
We now prove this claim. Let 
M, = max{W,..., Wn}. 


The maximum M,, is equal to some W; for exactly one i in {1, ...,} (this is where 
the assumption that the distribution is continuous is critical). Thus, 


n 
Ss P(M, = W;) = 1. 
i=l 


Because the W; are identically distributed the maximum M,, is equally likely to 
occur at any time i <n. That is, 


P(M, = Wi) = P(M, = W2) =--- = P(M, = W,). 


Therefore, 


SS P(M, = W;) =nP(M, = Wn) = 1. 


i=1 
Hence, 
1 
P(M, = W,) = -. 
n 
Note now that the event {M,, = W,} is exactly the event {X, = 1} (ie., there is a 


record at time 7). This proves the claim. 


e The expected number of records up to time n is, 


ERS) a 
i=l 


To prove the formula for E(R,,) we just use that R, = X; +---+ X, and that 
E(X;) = 1/i fori =1,2...,n. 

As n goes to infinity )77_, ; can be approximated by In(). Hence, the number 
of records grows very slowly with n. For instance, up to time n = 10° we expect 
only about 11 records. 


228 20 Sums of Bernoulli Random Variables 
Problems 


1. Let B be the number of distinct birthdays in a class of 200 students. What is the 
E(B)? 

2. There are 8 people in a bus and 5 bus stops ahead. What is the expected number 
of stops the bus will have to make for these 8 people? 

3. (a) Roll a die twice. What is the expected number of distinct faces? 
(b) Roll a die n times. What is the expected number of distinct faces? 

4. Assume that 5 people throw their hat in a bag and then pick at random a hat in 
the bag. 


(a) What is the probability that each person gets her own hat? 
(b) What is the expected number of persons who get their own hat? 


5. Let M, be the number of matches for n balls and n boxes. We have shown that 
My = X1 +--+ + Xn, 


where the X; are Bernoulli random variables with the same p = 1/n. 


(a) Explain why M,, is not a binomial random variable. 

(b) Explain why as n goes to infinity it would make sense to approximate M,, 
by a binomial with parameters (n, 1/7). 

(c) Using (b) show that M, may be approximated using a Poisson random 
variable. With what parameter? 

(d) Use the Poisson approximation in (c) to find the probability of having 2 or 
more matches. 

(e) We simulated matching for n = 50. We ran the simulation 1000 times and 
got the following results: 


Number of matches | 0 1 2 3 |4 |5 
Frequency 367 | 383 | 173 |58 | 16 |3 


Are these results consistent with a Poisson distribution? 
6. Consider an urn with b blue marbles and r red marbles. Assume that we take 
randomly n marbles from the urn without replacement. For | < i < n, let 
X; = 1 if the i-th draw is a blue marble. Let X; = O otherwise. 


(a) By conditioning on the value of X; compute P(X2 = 1). 

(b) By conditioning on the values of X; and X2 compute P(X3 = 1). 
(c) By conditioning on the value of X; compute P(X2X3 = 1). 

(d) By conditioning on the value of X2 compute P(X, X3 = 1). 


Problems 229 


7. In a certain game a bet can be lost, won, or tied. I bet n times, the number 
of wins and losses are denoted by W and L, respectively. I win a bet with 
probability p and lose a bet with probability r so that p +r < 1. Assume that 
the bets are independent of each other. 


(a) Explain why W, L, and W + L have binomial distributions. Specify the 
parameters for each random variable. 
(b) Using that 


Var(W + L) = Var(W) + Var(L) + 2Cov(W, L), 


show that Cov(W, L) = —npr. 

(c) Explain why we could have expected a negative sign in (b). 

(d) Compute the correlation between W and L. What is remarkable about this 
result? 


Chapter 21 ®) 
Coupling Random Variables cen 


In this chapter we give several examples of coupling. This is a very important 
probability technique. It is useful in proving many results in a simple and elegant 
way. 


1 Coupling Two Bernoulli Random Variables 


Let 0 < py < p2 < 1. Consider two Bernoulli random variables X; and Xz with 
parameters p; and po, respectively. Since X; = 1 with probability p; and Xz = 1 
with probability p2, one might expect X2 to be | when X is 1. However, if we take 
X, and X> to be independent, then X; = | and X2 = 0 with probability p; (1 — pz). 

The technique described below shows that it is possible to construct X; and X2 
(they will not be independent!) in such a way that when X, is 1 so is X. In fact, we 
will have X; < X2. 

Let U be a uniform random variable on (0, 1). We define the random variables 
X, and X> as follows. 

If U < p,, then set X; = 1. If U > py, then set X; = 0. Similarly for Xo, set 
X2 = 1if U < po, and X2 = Oif U > po. Note that we use the same U to define 
X, and Xy. The random variables X; and X2 are said to be coupled. 

This construction of the vector (X;, X2) yields several properties. Since U is 
uniform on (0, 1), P(U < x) = x for all x in (0, 1). Therefore, 


P(X; = 1) = py and P(X; = 0) = 1—- pj. 


That is, X; is a Bernoulli random variable with parameter p;. Similarly, X2 
is a Bernoulli random variable with parameter p2. Moreover, we get the joint 
distribution of (X;, X2). 


© Springer Nature Switzerland AG 2022 231 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_21 


232 21 Coupling Random Variables 


X_\X, |0 if 
0 l—p. |0 
1 p2-Ppi | Pi 


The probabilities in the table are easy to check. For instance, X; = 0 and X2 = 0 
if only if U > pz. This happens with probability 1 — p2. Note also that X; = 1 
and X2 = Oif U < p; and U > py. Since pi < pz, this cannot happen and 
has probability 0. Note that the event (X;, X2) = (1,0) is the same as the event 
X, > Xz. Since this event has probability 0, we see that X; < X2 has probability 
one. Thus, this coupling achieved our goal of constructing the vector (X1, X2) in 
such a way that X; < Xp. 


2 Coupling Two Poisson Random Variables 


Let 0 < a < b. Let X and Y be two Poisson random variables with means a and b, 
respectively. The mean of X is less than the mean of Y. Can we construct a coupling 
of (X, Y) such that X < Y? It turns out that we can and this is what we do now. 

Let X be a mean a Poisson random variable. Let V be a mean b — a 
Poisson random variable independent of X. An important property of the Poisson 
distribution is that a sum of two independent Poisson random variables is also a 
Poisson random variable (see the Problems). Let Y = X + V, and then Y is a 
Poisson random variable with mean a + (b — a) = b. Moreover, since V > 0 
(why?), X < Y. Hence, we have constructed the vector (X, Y) with the property 
X<Y. 


3 The Coupling Inequality 


A coupling between random variables X and Y can be used to evaluate how close 
the distribution of X is from the distribution of Y. A natural way to compare the two 
distributions is to compute 


|P(X =k)— P(Y =h) 


for every positive integer k and see how large this difference can be. We will actually 
use a more stringent metric. Instead of comparing the distributions for individual k’s, 
we will compare them on sets of integers. More precisely, let D be a set of positive 
integers and then we will bound 


|P(X © D)— P(Y € D)| =| Si (P(x =k) — P(Y =k))|. 
keD 


3 The Coupling Inequality 233 


The following probability property will be useful. Recall that for any events A 
and B, 


P(A) = P(AN B)+ P(AN BS). 
Hence, 
P(A) = P(B) + P(AN B*), 
and therefore, 
P(A) — P(B) < P(ANB‘). 
Let D be a set of positive integers, and using the inequality above for 
A={X € Dj and B= ({Y e€ D}, 
we get 
P(X eD)—PYeED)< P(X ED,Y ¢D). 
Since the event {X € D, Y ¢ D} is included in the event {X # Y}, 
P(XeD)—-PWVWeED)< P(XFY). 
Similarly, 
PYeD)—P(X ED) < P(X FY). 
Therefore, 
|P(X € D)— P(Y € D)| < P(X FY). (1) 
This is the coupling inequality. It gives an upper bound on how far away the 
distribution of X is from the distribution of Y. 
Note that the inequality holds for any D. It does not matter if we are comparing 
the distributions of X and Y ona small set D or on a large set D. The bound P(X 4 


Y) does not depend on D. 
We now apply the coupling inequality to two Poisson distributions. 


Example I We use the coupling from Sect. 2. Let X be a mean a Poisson random 
variable. Let V be a mean b — a Poisson random variable independent of X and 
Y = X + V. Then, Y is a Poisson random variable with mean b. 


234 21 Coupling Random Variables 


Note that V > 0. Therefore, X 4 Y if and only if V > 0. Since V is a Poisson 
random variable with mean b — a, 


P(V>0)=1-e 8, 
By the coupling inequality, for any subset D of positive integers, 
|P(X € D)— P(Y € D)| <P(X #Y) 
=P(V > 0) 
=1—¢e@-9, 
Since that for x > 0, 1 — e~* < x, 
|P(X € D)— P(Y € D)| < ba. 


The inequality above holds for b > a. In the case a > b a similar construction 
would yield an upper bound of a — b. Therefore, we can state the following result 
that holds for any a and b, 


| P(X € D)— P(Y € D)| < |b—al. (2) 


4 Poisson Approximation of a Sum 


In the Poisson chapter we have seen that a binomial distribution with parameters 
n and p may be approximated by a Poisson distribution with mean np when p is 
small. In fact, the Poisson approximation of a binomial is a particular case of a more 
general result that will be proved in this section by using a coupling technique. 

Let n > 1 and pj, p2,..., Pn be in (0,1). Let X1,..., X, be independent 
Bernoulli random variables with parameters p),..., Pn, respectively, and 


je Bre 
i=l 


For 1 <i <n, let A; = —In( — p;), and let W be a Poisson random variable with 
mean 


n 
L= Oe 
i=l 


4 Poisson Approximation of a Sum 235 


¢ For any subset D of positive integers, 


|P(T € D)— P(W Dis oi. (3) 
i=l 


In words, the distribution of a sum of independent Bernoulli random variables 
can be approximated by a Poisson distribution. This result will be useful when the 
upper bound 5 YL A? is small. 

Before we prove this result, we apply it to the Poisson approximation of a 
binomial. 


4.1 Poisson Approximation of a Binomial 


Let pj = --- = pn = p; then T is a sum of i.i.d. Bernoulli random variables. 
Hence, T is a binomial random variable with parameters n and p. Note that for all 
1 <i <n,d; = —In( — p). Therefore the sum A of the A; is A = —n In(1 — p) 


and W is a mean A Poisson random variable. By (3), we get 


jlpeee 
|P(T € D)— P(W D)| <5 8 
i=l 


7 2(1 — p) 
=-n in — é 
3 P 


We now show that W can be replaced by W’, a Poisson random variable with mean 
XV’ = np. We add and subtract P(W € D) to get 


|P(T € D)—P(W' € D)| = |P(T € D)—P(W € D)+ P(W € D)—P(W' € D)|. 
By the triangle inequality, 


|P(T € D)— P(W'€D)| < 
|P(T € D)— P(W e€ D)| + |P(W € D)— P(W’e€ D)| 


Since 


|P(T « D)— P(W D)| < 5mm? P), 


236 21 Coupling Random Variables 


and by (2) 


|P(W € D) — P(W’ € D)| <|—nIn(1 — p) —np| 
= —nIn(1 — p)—np, 


we get 


1 
|P(T € D)— P(W’ € D)| < zr in“(1 p) —nin(1 — p) —np. 
The r.h.s. provides an exact upper bound on the distance between binomial (”, p) 
and mean np Poisson distributions. For instance, for = 10 and p = 0.05 we get 
for any set D of integers 


|P(T € D) — P(W’ € D)| < 0.026. 


Next we show that as p approaches 0 


1 2 z 
aan (1. — p) —nIn(l — p) — np ~ np’. 


This provides a simpler expression that is quite close to the exact bound. For 
instance, for n = 10 and p = 0.05, np* = 0.025. 
Using the power series expansion 


1 4 1 , 


we see that for p approaching 0, 


1 1 
sn tn?(1 p)~ 5"? 
and 


1 2 
nin(l — p) —np ~ 5" : 


Therefore, as p approaches 0 
|P(T € D) — P(W' € D)| < np’. (4) 


This proves the classical approximation of a binomial (7, p) by a Poisson with 
mean 4’ = np. The error of the approximation is bounded by 2’ p. 


5 Proof of the Poisson Approximation 237 
5 Proof of the Poisson Approximation 


We now prove (3). We are givenn > 1 and pj, p2,..., Pn in (0, 1). We define 
for every | <i <n, A; = —In(1 — p;). Let Wj, ..., Wy, be independent Poisson 
random variables with means Aj, ..., An, respectively. 

For 1 <i <n: 


¢ Set X; = W; if W; = Oor W; = 1. 
e Set X; = lif W; > 2. 


Hence, X; can only take values 0 or |. Note that 


P(X; = 0) = P(W; =0) =e. 


Because of our definition of 4;, e~*’ = 1 — p;. That is, X; is a Bernoulli random 
variable with parameter p;. Since Wi,..., W, are independent, so are X1,..., Xn. 
Let 


n n 
P=). x and W = )° Wy. 
i=l i=1 


Note that W as a sum of independent Poisson random variables is a Poisson random 
variable with mean )°/_, Aj. 
Observe that T 4 W if and only if X; 4 W; for some i between | and n. Hence, 


P(T # W) =P((_{Xi 4 Wi}) 


i=l 


n 
a) PG AW. 
i=1 
Since X; 4 W; if and only if W; > 2, 
P(Xi # Wi) =P(W; = 2) 


=l—e* —)je* 


238 21 Coupling Random Variables 


where the last inequality comes from 1 — e~* — xe~* < x7/2 for all x > 0 (see the 
Problems). Hence, 


P(T £W) <)> P(X # Wi) 


i=1 
1 n 
2 
<5 UM 
i=1 
Using now the coupling inequality (1), we have for any set of integers D, 


|P(T € D)— P(We D)| <P(T 4 W) 
I> 
<M 


This completes the proof of (3). 


Problems 


1. Consider X, and X>2 defined in Sect. 1. 


(a) Show that P(X; = 1, X2 = 1) = pt. 
(b) Show that P(X; = 0, X2 = 1) = p2 — pi. 
(c) Show that the table in Sect. 1 is indeed a probability distribution. 


2. Consider X; and X> defined in Sect. 1. 


(a) Find the covariance of (X,, X2). 
(b) Are X; and X> independent? 


3. Let X and Y be two independent Bernoulli random variables with parameters p1 
and p2, respectively. 


(a) What is the distribution of the vector (X, Y)? 

(b) Compute the P(X # Y). 

(c) Let the distribution of (X;, X2) be defined by the table in Sect. 1. What is 
P(X, #X2)? 

(d) Use (b) and (c) to show that P(X; 4 X2) < P(X #Y). 


Problems 239 


4. Let X be a mean a Poisson random variable. Let V be a mean b — a Poisson 
random variable independent of X. Let Y = X + V. 


(a) Find the covariance between X and Y. 
(b) Are X and Y independent? 


5. Let B be a binomial random variable with n = 10 and p = 1/10. 


(a) Compute P(B = k) for all integers k. 

(b) Let N be a Poisson random variable with 4 = np. Compute P(N = k) for 
every k = 0,1,..., 10. 

(c) What is the maximum value for |P(B = k) — P(N = k)|? 

(d) Is (c) consistent with the bound computed in Example 2? That is, is it true 
that 


|P(B =k) — P(N =h)| <np* 


for all k? 
6. Let f(x) = 1—e7* — xe~* — x?/2. 
(a) Show that f is a decreasing function [0, +00). 
(b) Use (a) to show for x > 0, 
Pe xe. * S372. 
7. Let T be a binomial random variable with p = 1/n and n. Let W be a mean 1 
Poisson random variable. 


(a) Show that for any positive integer k, 


|PT =) — PW = BIS 


(b) Show that asn > +00, P(T =k) ~ P(W =k). 


8. Let 7, be a binomial random variable with parameters (n1, p,). Let T> be a 
binomial random variable with parameters (n2, p2). Assume that T; and 7T> are 
independent. 


(a) If we approximate T; + T> using a Poisson W, what should the mean A of W 
be? 

(b) Show that for any subset D of integers, provided p; and p2 are close enough 
to 0, 


|P(T) + Tz € D) — P(W € D)| < np} +n2p3. 


240 21 Coupling Random Variables 


9. Assume that X and Y are independent Poisson random variables with means a 
and b, respectively. Let n > 0. 


(a) Show that 


P(X+Y¥ =n)=) > P(X =k)PW =n—bh). 
k=0 


(b) Use (a) to show that 


1 a n 
—(a+b) kpn—-k 
P(X +Y=n)=e =P (jaro : 


k 
k=0 


(c) Use the binomial theorem in (b) to get 
—(a+b) 1 n 
P(X +Y=n)=e —(a+b). 
n! 


(d) What is the distribution of X + Y? 


Chapter 22 ®) 
The Moment Generating Function cen 


1 Definition and Examples 


In this chapter we introduce moment generating functions. We will show that they 
can be used in multiple applications such as finding the distribution of sums of 
independent random variables and computing limiting distributions. 


e Let X be a random variable. The moment generating function (m.g.f. in short) 
of X is defined by 


Mx(t) = E(e'*). 
If X is a discrete random variable with support S, then 


Mx(t)= Ye P(X =k). 
keS 


If X is a continuous random variable with density f and support (a, b), then 


b 
Mx(t) =) e'* f (x)dx. 


Example I Let X be a Bernoulli random variable with parameter p. Compute its 
m.g.f. 


Since P(X = 0) = 1 — pand P(X = 1) =p, 
Myx (t) =P(X = 0) +e’ P(X = 1) 
=(1— p)+ pe’ 


© Springer Nature Switzerland AG 2022 241 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_22 


242 22 The Moment Generating Function 


Example 2 Assume X is exponentially distributed with rate 2. Its m.g.f. is 


Mx(t) =E(e’*) 


Note that the improper integral above is convergent if and only if t — A < 0. Hence, 
the m.g.f. of an exponential with rate A is only defined for t < 2. 


As shown in Example 2 the moment generating function need not be defined for 
every t. In fact for many distributions, the m.g.f. is only defined at t = O (and 
is then useless). To be useful an m.g.f. must be defined on an interval (however 
small). 


We are now going to state two properties that make moment generating functions 


very helpful: 


Assume that My (t) = My (tf) for all ¢ in an interval around t = 0. Then, X and 
Y have the same distribution. 


In words, moment generating functions characterize distributions. For the proof 


of this property as well as other properties of m.g.f., see Chapter 56 in Port (1994). 


Example 3 Let X be exponentially distributed with rate 2. 


Define Y = 7X. What is the distribution of Y? 
We compute the m.g.f. of Y. 


My(t) =E(e'”) 
=E(e*) 
=My(At). 


By Example 2, My (t) = “4 for t < X. Hence, 


My (t) =Mx (At) 


1 Definition and Examples 243 


That is, the m.g.f. of Y is the m.g.f. of a rate 1 exponential random variable. 
Therefore, Y is exponentially distributed with rate 1. 

Another critically important property of moment generating functions is the 
following: 


e Let X and Y be two independent random variables. Let S = X + Y, and then 


Ms(t) = Mx(t)My(t). 


We now prove this property. Note that since X and Y are independent, so are e’* 


and e’ . Hence, 
Ms(t) =E oe) 
=F (e*e'")) 


=E(e'*)E(e”) 
=Mx(t)My (t). 


1.1 Sum of i.i.d. Bernoulli Random Variables 


Let B be a binomial random variable with parameters (n, p). Compute its m.g.f. 
Recall that B can be written as 


B=X,+Xo+-+:-+ Xn, 


where X,, X2 + ..., Xy are ii.d. Bernoulli random variables with parameter p. 
Hence, 


E(e!®) =E (exp(t(X1 + X2 +--+ Xn)) 


=E (ete™ bs a) 


By the independence of e’*', e’*2,... e'* 


”, we get 
E(e®) = E(e™*')E(e'*2)... E(e’*"), 


Since X1, X2...Xy all have the same distribution, they have the same m.g.f. By 
Example | a Bernoulli random variable with parameter p has m.g.f. 1 — p + pe’. 
Hence, 


E(e'®) = (1— p+ pe')”. 


244 22 The Moment Generating Function 
1.2. Sum of Independent Poisson Random Variables 


We first compute the m.g.f. of a Poisson random variable. 


¢ Let N bea Poisson random variable with mean i, and let My be its m.g.f. Then, 
My(t) = exp(A(—1 + e*)) for all f. 
Mn(t) =E(e’) 


Recall that for every real number x, we have the following power series expansion 
for the exponential function, 


xk 
kl 
k=0 


We use this power series expansion with x = e'A to get for all r, 


oo ty k 
Mn(t) =e 
k=0 : 


=e* exp(e!A) 


=exp(A(—1 + e’)) 


We now turn to the sum of Poisson random variables. 


e Let N1, No,..., Ng be independent Poisson random variables with parameters 
A1,A2,.-.,Ax, respectively. Then, N; + N2,---+ Nz is also a Poisson random 
variable and its parameter is A; + Az +---+ Ag. 


Let 


N=Ni+No+---+Ng. 


Problems 245 


Then, 


My(t) =E(e®) 
=E (exp(t(Nj + No +---+ Nx)) 


=E (em ee. Ne) 


tN, tN: 
lle 2, 


By the independence of e ..e'Nk, we get 


E(eN) =E(e™!) E(e§2)... Ee") 
= exp(Aj(—1 + e*)) exp(A2(—1 + e’))...exp(Ag(—1 4+ e*)) 
=exp(A(—1 +e’), 


where X = Ay +Ad +--+ +A. 
Since the m.g.f. of N is the m.g.f. of a Poisson random variable with parameter 
A, this proves that NV is a Poisson random variable with parameter A. 


Problems 


1. Compute the moment generating function of a geometric random variable with 
parameter p. 

2. Compute the m.g.f. of a uniform random variable on (0,1). 

3. Let X and Y be independent binomial random variables with parameters n and 
P1, and m and pz, respectively. 


(a) Show that if p; = po, then X + Y is also a binomial random variable. 
(b) Show that if pj ~ p2, then X + Y is not a binomial random variable. 


4. Let X and Y be two independent geometric random variables with the same 
parameter p. That is, 


P(X =k) = P(Y =k) = p(i— p)! fork = 1,2... 
Show that 
P(X +Y=n)=(n—1)p’(1— p)"” forn =2,3,.... 


5. Let X be a random variable with m.g.f. My. Let Y = aX +b, where a and b real 
numbers. Show that 


My (t) =e’ My(at). 


246 22 The Moment Generating Function 
2 Them.g.f. of a Normal 


We will apply m.g.f. to the normal distribution to prove a number of important prop- 
erties. We start by computing the m.g.f. of a standard normal random variable. 


e Let Z bea standard normal distribution. Then the m.g.f. of Z is for all r, 


I 5 
Mz(t) = exp(;t ). 
We now prove the formula. 


Mz(t) =E(e“") 


[oe 
=a et Le Paz 


—0o 20 
=i a, eto gz 
—oo Qn 


We complete the square to get 


zt — 27/2 = —(z — 1)? /2 + #7/2. 


Thus, 
2 oor id 2 
Mz(t) =e al ge da 
—oo V2 
Note that g(z) = qe 72 is the density of a normal distribution with mean ft 


and standard deviation 1. Therefore, 


[o,e) 
1-9 Age =. 


—oo 20 


Thus, 
17/2 
Mz(th=e ". 


We use the preceding formula to get the m.g.f. for any normal random variable. 


¢ A normal random variable X with mean jz and variance o” has moment 
generating function 


1 42 
Mx(t) = exp(ut + te t=); 


2 The m.g.f. of a Normal 247 


First recall that 


is a standard normal random variable. Since X = w+oZ, 


Mx (t) =E (exp(tX)) 
=E (exp(tu+toZ)) 
=exp(tu) E(exp(toZ)) 
=exp(tu)Mz (to). 


Since Mz(t) = e"/?, 
1 222) 
Mx (t) = exp(ut + ae t“). 


A first application of the m.g.f. is the following. 


e Let X be normally distributed with mean jz; and variance ae, Let Y be normally 
distributed with mean jz2 and variance Gy Assume that X and Y are independent. 
Then, aX +bY is normally distributed with mean aj +b22 and variance ao; + 

2,2 
box 
We find the distribution of aX + bY by using its m.g.f. Let V = aX + bY. 


My (t) =E(e'”) 
=E(e!4% efbY) 


Since X and Y are independent, so are e'* and e!@”. Hence, 
My (t) = E(e*) E(e””). 
Note now that E(e'@*) = Mx(at) and E(e'®”) = My (bt). Thus, 
My (t) = My (at)My (bt). 


Use now the m.g.f. of X and Y, 
My (t) =exp(myat + 501 (at) ) exp(u2bt + 503 2 (bt)*) 


=exp (can + bu2)t + 5P@of + v03)) 


248 22 The Moment Generating Function 


That is, 
L255 
My (t) = exp(ut + 5 t“) 


where pz = ajty + bz and o? = Go; + bras: Since the m.g.f. of V is the m.g.f. of 
a normal with mean jz and variance o”, V is a normal random variable with mean 


and variance o?. 


Example 4 Assume that X is a normal random variable with mean | and variance 
1. Assume that Y is a normal random variable with mean 1 and variance 4. Assume 
that X and Y are independent. What is P(Y > 2X)? 

Note that P(Y > 2X) = P(V > 0) where V = Y — 2X. Then, V is normally 
distributed with mean -1 and variance 


V12 + (-2)2 x 4= V17. 


P(Y > 2X) =P(V > 0) 


V+i1 1 
= P(—— Pome 
TA > iF 


ef I 


where Z is a standard normal random variable. By the normal table, 


P(Y > 2X) ~ 0.41. 


3 Moment Computations 


Moment generating functions get their name from the following property: 


* Let X be a random variable and k > 1 be an integer. The expectation E(X*) is 
called the kth moment of X. If X has a moment generating function My defined 
on some interval around 0, then all the moments of X exist. Moreover, 


E(X*) = MY (0) 


where Me designates the kth derivative of My. 


Example 5 We use the formula above to compute the first two moments of the 
Poisson distribution. Let N be a Poisson random variable with mean 2. Then My is 


3 Moment Computations 249 


defined everywhere and 
My (t) = exp((-1 + e’)). 


Note that the first derivative is 


“Mn = Ae’ exp(A(—1 +4 e’)). 


Letting t = 0 in the formula above yields M a (O) = 2. Therefore, E(X) = i. 
We now compute the second derivative 


My (t) = Ae! exp(a(—1 + e’)) + A7e7' exp(A(—1 + e')). 
Thus, 
My (0) =A+2°. 


Hence, E(X*) = A +A. Since we have the first two moments, we can compute the 
variance of X by 


Var(X) = E(X”) — E(X)? =A. 


Example 6 Let X be an exponential random variable with parameter a > 0. Recall 
that My is defined for t < a by 


a 


Hence, 


a 


Mx(t) = Gane 


Letting tf = 0, we get E(X) = 1/a. The second derivative is 
My(t) = 2a(a —1t)~?. 
Therefore, E(X?) = 2/ a’. Thus, the variance is 


1 


Var(X) = 5 
a 


2 1 
a a 
We now turn to a method that may provide all the moments at once. 


Example 7 Let X be an exponential random variable with parameter a > 0. Find 
E(X*) for all k > 0. 


250 22 The Moment Generating Function 


Recall the geometric power series. For |x| < 1, 


Let t < a. We are going to use the geometric power series for x = t/a. 


a 
Mx (t) =—_ 
a-t 


my 
~Poitfa 


Ce 
ae 
a 
k=0 


Recall from Calculus that if My has a power series expansion, then 


yo) 
Mx(t) = D 3 ae 
k=0 , 


Hence, 


Since power series expansions are unique, for every k > 0, 


My?(0) 1 
kL ak 
Therefore, 
! 
M® (0) = a 
Thus, for all k > 0, 
k! 


E(X*) = ok 


Problems 251 


Problems 


6. 


10. 


11. 


12. 
13. 


14. 


15. 


Assume that X is a normal random variable with mean 1 and variance 1. 
Assume that Y is a normal random variable with mean 2 and variance 1. 
Assume that X and Y are independent. 


(a) Whatis P(Y > X)? 
(b) What is P(Y < 3X)? 


. Use the m.g.f. technique to show that if X is normally distributed with mean ju 


and variance o”, then 
Y=aX+b 


is also normally distributed. With what parameters? 


. Assume that in a population heights are normally distributed. The mean height 


for men is 172cm with SD 5cm and for women the mean is 165 cm with SD 
3 cm. What is the probability that a woman taken at random be taller than a man 
taken at random? 


. Compute the first 3 moments of a binomial random variable by taking deriva- 


tives of its m.g.f. 

Compute the first two moments of a geometric random variable by taking 

derivatives of its m.g.f. 

(a) Compute the first two moments of a uniform random variable U on (0,1) 
by taking derivatives of its m.g.f. 

(b) Show that the m.g.f. of U has the following power series expansion for all 
t, 


CO Jk-1 
Mo) => —. 
! 

— ki! 


(c) Use the method of Example 7 to find all the moments of U. 

Compute the fourth moment of a standard normal distribution Z. 

What is the m.g.f. of a normal distribution with mean | and standard deviation 
22 

Use the m.g.f. to compute the first two moments of a normal distribution with 
mean yj and standard deviation o. 

Let Z be a standard normal distribution. 


(a) Show that the m.g.f. of Z has the following power series expansion for all 
t, 


OO 42k 


Mz(t) = om 


k=0 


252 22 The Moment Generating Function 


(b) Show that for all integer k > 1, 


pie OR)! 

ee 

(c) Show that (b) simplifies to 
E(Z*)=1x3x5x...(2k—1). 


(d) Compute numerically the first 10 moments of Z. 


4 Convergence in Distribution 


Our final application of moment generating functions regards the convergence of 
sequences of random variables. Our main tool will be the following theorem: 


¢ Consider a sequence of random variables X;, X2,... and their corresponding 
moment generating functions M,, M2, .... Assume that there is r > 0 such that 
for every ¢ in (—r,r) 


lim M,(t) = M(t), 
n—>oo 


where M is the moment generating function of some random variable X. Then, 
the distribution of X,, approaches the distribution of X as n goes to infinity. 


The proof of this theorem is beyond the scope of this book. We will apply it to 
two interesting cases. 


4.1 Binomial Convergence to a Poisson 


Consider a sequence of binomial random variables X, for n > 1. Each Xy is 

a binomial with parameters (1, py) where py, is a sequence of strictly positive 

numbers. Assume also that np, converges to some A > 0 as n goes to infinity. We 

will show that X,, converges in distribution to a mean A Poisson random variable. 
For n > | the m.g.f. of X;, is 


M,(t) = 1 - Pn t+ pie x 
and so 


In M(t) = nnd — pp + pne’). 


4 Convergence in Distribution 253 


Observe that 


NPn 
n 


Pnr= 


and since np, converges to A, we see that p, converges to 0. Hence, for fixed f, 
—Pn + pre’ converges to 0 as well. We multiply and divide by —p, + pre’ to get 


In(1 = pa + pre’ 
In My(t) = PO Pa* Pn p, + pye!), 
—Pn + Pne 


Now, recall from Calculus that 


. ndi+x) 
lim. ———— = 


x >0 x 


1. 


In particular if x, is a non-zero sequence converging to 0, we have 


_ Md +x) 
lim ——— = 


n—>0oo Xn 


ie 


We apply this result to x, = — pn + pne’ to get 


_ In(l = pn + Pre’) 
lim =, 
noo —pPnt Pne! 


Observe also that 
n(—Pn + Pre’) = npn(-1 +e’) 
converges to 4(—1 + e’). Hence, 
In M,,(t) converges to A(—1 +4 e’) 
and 
M,(t) converges to exp(A(—1 + e’)). 
Since exp(A(—1 + e')) is the m.g.f. of a mean A Poisson random variable, we 


have proved that the sequence X, converges in distribution to a mean A Poisson 
distribution. 


254 22 The Moment Generating Function 
4.2 Proof of the Central Limit Theorem 


The Central Limit Theorem requires only that the variance exists. We will sketch 
the proof of the Central Limit Theorem with the much more restrictive hypothesis 
that the moment generating function exists. 

Let X1, X2,..., X, be a sequence of independent identically distributed random 
variables with mean jz and variance o*. Assume that there is r > 0 such that each 
X; has am.g.f. defined on (—r, r). Let 


= Xi + Xo+++-+Xp 
Xn = 7 . 


We want to show that the distribution of 
Xa 
Pe n—- 
a//n 


approaches the distribution of a standard normal distribution. We start by computing 
the moment generating function of 7,,. For every n we denote the m.g.f. of J, by 
M,. By the definition of the m.g.f. 


M,(t) =E(e'™) 
Xn He 


o/Jn 


Xn 


=E(exp(t )) 


=) 


oO 


=E(exp(t./n 


Observe now that 


Xn— pe 1. Xi - 
oO ae o 
i= 


Let Y; = =“. Then, 


oO 


= vayy 
My(t) = E(exp(t~— d, ¥;)). 


4 Convergence in Distribution 


Since the Y; are independent and have the same distribution, 


M,(t) =E (exp "¥))) Eexpe Y)).. E(exp"¥,)) 


t 

=M M . My, (—= 

Y ee ne) rn (Fe) 

t n 

=({ My, (— 

( "P) 

We now write a third degree Taylor expansion for My. 
t 12 2B 
My (Fe) = Mr) + My) + 5, 57 My y0) + oa MY (sn) 


for some s,, in (0, a): Since the Y; are standardized, 


M},(0) = E(Y) = 0 and Mj(0) = E(Y?) = Var(Y) =1. 


We also know that My (0) = 1. This is true for any random variable. Thus, 


‘ e 2 
My(—=) =1+ ree aaa My (sn). 


Jn 6n3/2 


Hence, 


t 
In(M, (t)) =n In (mi) 
2 - 


t Wr 
=n In see a a aaa My (sy) 


L _ 2 B MY” Th 
et Xn = a + Ga My (s;,). Then, 


nin(My(=)) =nIn(1+x,) = jg a 


n 


Since s, converges to 0, My’ (sn) converges to My’ (0). Thus, 


lim nxn) = t*/2. 
noo 


255 


256 22 The Moment Generating Function 


Since x, converges to 0 as n goes to infinity, 


_ In(lt+ xn) 
im ———— = 1 
noo Xn 
Therefore, 
; : t 
jim In(Mn(t)) = lim In (ms) 
: In(1 + xn) 
= lim nx, ———— 
noo Xn 
12 
me 
Hence, 


lim M,(t) =e" /2. 
no 


Xn-M 
o/J/n 
converges (as n goes to infinity) to the moment generating function of a standard 


That is, the sequence of moment generating functions of the sequence 


Xn converges 
o//n 


to the distribution of a standard normal random variable. This concludes the proof 
of the Central Limit Theorem. 


normal distribution. This is enough to prove that the distribution of 


Problems 


16. Let X;, be a sequence of geometric random variables. Each X;, has a parameter 
Pn. Assume that p, is a strictly positive sequence converging to 0. Let M,, be 
the m.g.f. of py Xn. 


(a) Show that 


tPn 
A eee 
l=G-pae™ 
(b) Show that for every ¢ 
: 1 
lim M,(t) = ——. 
n> oo 1-—t 


(c) What is the limiting distribution of p,X, as n goes to infinity? 


Problems 257 


17. Let X, be a sequence of Poisson random variables. For each n, X, has mean n. 
We are going to show that 


converges in distribution to a standard normal. Let M,, be the m.g.f. of Y;. 


(a) Show that 


In(M,(1)) = t/a +n(-1+e%*). 


(b) Show that for every ¢ 


2 
lim In(M,(t)) = —, 
n—> oo ee. 


and conclude. 


Chapter 23 M®) 
Chi-Squared, Student, Sect a 
and F Distributions 


1 The m.g.f. of aGamma Random Variable 


¢ The moment generating function of a Gamma random variable with parameters 
r >Oanda > Ois defined by 


x F 
Mx(t) = (4) ; 


We now compute this formula. Recall that the density of a Gamma random 
variable with parameters r and A is for x > 0, 


for all t < X. 


f(x) = ett ytd, 


P(r) 


Hence, 
Mx(t) =E(e*) 


oo ar 
Sh ex ety lay 
0 rr) 


x" [o,@) 


“Try Jo 


xl eH AD dy. 


Note that the preceding improper integral converges only for t < A. 


© Springer Nature Switzerland AG 2022 259 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_23 


260 23 Chi-Squared, Student, and F Distributions 


By dividing and multiplying by (A — 1)’, 


— r re eae a - r—1,—(A-t)x 
MaGy=(—,) FO : xe dx 


r r 
ay ; 
where we used that 


aan 
~ FR 


xl he Ax 


g(x) 


is the density of aGamma random variable with parameters r and A —t and therefore 


i g(x)dx = 1. 
0 


1.1 Sum of i.i.d. Exponential Random Variables 


e Letk > 1 and 7), To,..., Tx be independent exponential random variables with 
the same parameter a > 0. Then, 7; + 72, --- + 7, is a Gamma random variable 
with parameters k and a. 


To prove our claim we use moment generating functions. Let 


VET ce Ty 


Then, 
My(s) =E(e’") 
=E (exp(s(T, + Tz +---+ T%)) 
=E (em eh ett) 
Recall that 


E(ee") = 
a—s 


Problems 261 


Ty 
te 


By the independence of e°”"', e° ..e7k, we get fors <a 


My(s) =E(e")E(e”)... E(es"*) 


a—-sa—s a—s 


k 
_ a 
sere 
Since the m.g.f. of V is the m.g.f. of a Gamma with parameters k and a, we can 
conclude that the distribution of V is a Gamma with parameters k and a. 


Problems 
1. Let X1, X2,..., Xp» be a sequence of independent Gamma random variables with 
parameters (71, A), (72,4), .-., (nm, A). That is, they all have the same parameter 


A but have possibly different parameters r;. Show that 
Xi + Xq+-+-+ Xn 


is also a Gamma random variable. With what parameters? 

2. Let X be a Gamma random variable with parameters (r, A). Let Y = AX. Show 
that Y is also a Gamma random variable. With what parameters? 

3. Let X be a Gamma random variable with parameters r > 0 and 4 > 0. Use the 
m.g.f. to show that 


E(X) = . and Var(X) = or 


4. For y in (0, 1), let 


T(ir+s) 


s—1l.r—-1 
Tore” yo. 


fO)= 


Recall that f is the probability density of a Beta distribution with parameters 
(r,s). Let X and Y be two independent Gamma random variables with parame- 
ters (r, 1) and (s, 1), respectively. 


(a) Show that the probability density h of T = X + Y is fort > 0, 


t 
h(t) exo(-1) [ (t — y)*~!y"!dy. 


1 
~ TPO) 


262 23 Chi-Squared, Student, and F Distributions 


(b) Show that for all t > 0, 


_ i rhsl 


(By Problem | T is a Gamma random variable with parameters (1, r + s).) 
(c) Let t = 1 in the formulas in (a) and (b) to get 


s—l yr 1 _ Tore) 
[ (l— yy dy ~ Tirt+ts)’ 


(d) Show that f is indeed a probability density. 
(e) Use that '(n) = (n — 1)! for all integers n > 1 to compute 


1 
/ (1 — y)>y®dy. 
0 


2 The Chi-Squared Distribution 


¢ A Gamma random variable with parameters r = n/2 and A = 1/2 is also called 
a Chi-squared random variable with n degrees of freedom (d.f. in short). 


The Chi-squared distribution is closely related to the normal distribution as the 
next result shows. 


e Letn > land Z;, Z,...Z, be independent standard normal random variables. 
Let 
2 2 2 
T=2Z,4+Z254+-:-+Z). 


Then T is a Chi-squared random variable with n df. 
In order to find the distribution of T, we use the m.g.f. We proceed in two steps. 


First Step We compute the m.g.f. of Z*, where Z is a standard normal random 
variable. Using the density of a standard normal 


E(exp(tZ7)) -[- exp(tz”) z ex (a 
Pp TS Pp z Jon Pp 2 & 


—c 


= |- Kap a 
Seder a ee 


2 The Chi-Squared Distribution 263 


Recall that a normal with mean 0 and variance o~ has density on exp(— 57). 
Hence, 


1 +00 22 
—_ 2 de =", 
Thus, 
1 ie ( 22 \d 
—= exp(— — > =o0. 
V2 J—oo 20° _ 


For t < 1/2 we set (1/2 — t) = 1/o?, 


+00 2 
— | Fr eee Ey ee an ee a 
See an Os CE 
Note that 
(1 —21)-1/2 = ( 1/2 ye, 
es: 


Hence, Z7 has the m.g.f. of a Gamma with r = 1/2 and A = 1/2. Therefore, Z? is 
a Chi-squared random variable with 1 c.f. 


Second Step We now compute the m.g.f. of T. 
Mr(s) =E(e"") 
=E (exp(sZ7) exp(sZ3) Z -exp(sZ;)) 
=E(exp(sZ7)) E(exp(sZ3)) ... E(exp(sZ7)) 
= 1/2 ip " 
7 (« ites = 
1/2 


_ Fiz 
= ee : 


This is the m.g.f. of a Gamma with parameters r/2 and A = 1/2. Thus, T is a 
Chi-squared random variable with r df. 


264 23 Chi-Squared, Student, and F Distributions 


3 The Student Distribution 


e Let Z be a standard normal random variable and X be a Chi-squared random 
variable with r degrees of freedom. Assume that X and Z are independent. Then, 


Z 
~ VX/r 


is called a Student random variable with r degrees of freedom. Its density is for 
all t, 


rt) 
T(r/2)./ar 


As r increases to infinity, the density of a Student with r degrees of freedom 
approaches the density of a standard normal distribution, see the problems. 
We now compute the density of a Student with r degrees of freedom. Let 


f= Gena 


Z 
T = ——andU =X. 
VX/r 


We invert the relations above to get 
Z=T/U/r and X =U. 


The Jacobian of the transformation above is 


nee a ie m dur: 


Since Z and X are independent, the joint density of (Z, X) is for all z and for x > 0 


I grt eRe a]t: 


J 27 2"/2T (r/2) 
The joint density of (7, U) is then for all t and for u > 0 


i en PW Qn) yr/2-l oul? [yg Ty, 


J272"/2F (r/2) 


In order to get the density of T, we integrate the joint density of (T, U) in u, 


1 oo r+l_4 —u(2 41/2) 
) = ——__— “2 a du. 
Te) ae ee =e ‘ 


4 The F Distribution 265 


To compute the preceding integral, we use a Gamma density in the following way. 
We know that 


[o,@) aS 
/ x5 eX dy = 1 
0 TGQ) 


for all s > O and A > 0. This is so because the integrand above is the density of a 
Gamma random variable with parameters s and A. Therefore, 


is xe le dy = Es) 
0 nS . 


Now let s = ry and A = - + 1/2 in the preceding formula, 


rn) 
= 


rly a(P44/2 
u 2 e uget / du Spe 
0 (5 +1/2)% 


Thus, 


+1 
1 My) 


a) ~ Vin ¥PPr/2D (2 41/HE 


— re) 


TG) Var 


The computation of the Student density is complete. 


(L+ 2 /ry Orne, 


4 TheF Distribution 


e Assume that X and Y are independent Chi-squared random variables with m and 
n degrees of freedom, respectively. Then the distribution of 
X/m 
Y/n 


is called the F distribution (after R.A. Fisher) with (m, n) degrees of freedom. Its 
density is 
T(m+n)/2) m 


_ m/2,,m/2-1 Mm ._(n+m)/2 
fO) = FonfDra/D n> v Cn) for v > 0. 


266 23 Chi-Squared, Student, and F Distributions 


In order to compute the density of a F distribution, we go back to Gamma 
distributions. Recall that the ratio of two independent Gamma random variables 
with parameters (r, 4) and (s, A) has probability density 


Torts) uw! 


Pore) webs foru > 0. 


Since a Chi-squared random variable with m degrees of freedom is a Gamma 
random variable with parameters r = m/2 and A = 1/2, U = X/Y has probability 
density 


P(im + n)/2) unl?) 


= fe 0. 
ful) Fon/DF(n/D wa Done oru > 
Let 
_ X/m 
~ Y/n’ 
and then 
n X n 
V=——=-U=cvu, 
mY m 


where c = n/m. Let Fy and Fy be the cumulative distribution functions of U and 
V. Then, 


Fy(v) = PV <v) = PU <v/c) = Fu(v/c). 


By taking derivatives with respect to v on both sides, 


1 m m 
fv(v) = — fu(v/e) = — fu(—») 
c n n 
where fy and fy are the densities of U and V. Hence, for v > 0 
T((m+n)/2) (Buy?! 
P(m/2)T (n/2) (Bu + Yer? 


_ Tm + n)/2) (mym/2ym 2-1 zy —y) etm? 


~ T(m/2)P (n/2) 1 


fy(v) = 


This completes our computation of the Fisher probability density. 


Problems 267 


Problems 


10. 


11. 


. Sketch on the same graph the probability density of a standard normal, of a 


Student with 1 d.f., a Student with 5 d.f., and a Student with 10 d.f. (Use that 
P(1/2) = Jz and that P(r) = (r — DI (r — 1).) 


. Sketch on the same graph the probability density of three Fisher distributions 


with (m,n) = (1, 1), (m,n) = (2,5), and (m,n) = (5, 2). 


. Find the expected value and the variance of a Chi-squared random variable with 


n degrees of freedom. 


. Assume that X and Y are independent Chi-squared random variables with d_f. 


k and n, respectively. What is the distribution of X + Y? With what parameters? 


. Let X and Y be independent random variables. Let T = X + Y. Assume that X 


and T are Chi-squared random variables with d.f. k and n, respectively. Show 
that Y is also a Chi-squared random variable. With how many d.f.? 
T be a Student rv. with n df. 


(a) Show that ifn > 1, then the improper integral defining E(T) exists. 
(b) Show thatifn > 1, then E(T) = 0 (no computation needed!). 
(c) Show that ifn > 2, then Var(T) = =*5. 


Find the mean and variance of a Fisher distribution with (m, n) d.f. 


Chapter 24 ®) 
Sampling from a Normal Distribution pei 


1 The Sample Average and Variance 


Let X,, X2...X, bei.i.d. with a normal distribution with mean jy and variance o”. 


Recall that the sample average X is defined by 


=. | 
X= —(X1 + X2 +--+ Xn), 
and the sample variance is 


1 _ 
Soi =e, 
n—-1 4 


S= 


The sample average and the sample variance depend on the same observations 
X 1, X2...Xy. Therefore, in general the sample average and the sample variance 
are not independent. A remarkable property of the normal distribution is the 
following. 


* For a normal distribution the sample average X and the sample variance S? are 
independent. 


We will prove this result for the standard normal distribution (i.e., 4 = 0 and 
o = 1). The result for a general normal distribution is an easy consequence of this 
particular case, see the problems. We prove this result in two steps. 


First Step We prove that X and the vector (X27 — X, X3 — X,...,X, — X) are 
independent. a 
Denote the vector (X, X2 — X, X3 — X,..., Xn — X) by (%, Yo,..., Yn). 


© Springer Nature Switzerland AG 2022 269 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_24 


270 24 Sampling from a Normal Distribution 


Consider the transformation 


yi =x 
yo =xX2 — X 


Vn =Xpn — X 


It is not difficult to show that 


X1 =Yl— y2--++— Yn 
x2 =y1 + y2 
Xn =Y1 + Yn 


The Jacobian of this transformation is n, see the problems. 
Since X1,..., X, are i.i.d. standard normal random variables, we get the joint 
density of the vector (X1,..., Xn) 


= 1 1 2 yy 
FO, +5 %n) = ape eXP a Gy +o +H) F 


for (x1,..., Xn) € R”. Note that 


n n 7 
er =(1 - ys yi)? + xXCr + yi)? 
i=l i=2 i=2 
z is n n 
=yi - 2 Dw t Oly)? + — Dyi +2 Diy + oy? 
= i=2 i=2 i=2 


n n 
=nyit oy +O >)? 
i=2 i=2 


Hence, the joint density of the vector (Yj, ..., Yn) 


1 n n 
FOI, ++ +5 Yn) “aye exp (-5 (mi at >»? + Oo »)) 
i=2 i=2 


1 The Sample Average and Variance 271 


Therefore, 
fOr, ie 1a ep, Yn) = A(yi)gQ2, Y3; a) Yn), 
where 
h = n Nn » 
(v1) = Onn? exp (-39t) 5 
and 


1 n n 
8(y2, Y3,--++ Yn) = exp (-3 (> yt. ») , 
i=2 i=2 

for (y1,---, Yn) € R”. This proves that Y; and the vector (Y2,...,Y,) are 
independent. That is, the random variable X and the vector 

(X — X, X3— X,...,X, —X) 
are independent. This completes the first step. 
Second Step In this step we show that S? can be written using 

(X> — X, X3— X,...,X, —X) 


only. Since this vector is independent of X, this will prove that S? and X are 


independent. = 
Since )~;_,(X; — X) = 0 (why?), 


n 
X,-X= — SOX —X). 
i=2 
Therefore, 


1 2 
(X,-X) = (dy - ») 
i=2 
Hence, 


S0(%) — X)? =(%1 — X) + DOK -— XY? 


i=1 i=2 


n 2 n 
= (Se - %) +> (Xi - xX). 
i=2 


i=2 


272 24 Sampling from a Normal Distribution 
This shows that S? can be written using (X2 — X, X3 —X,...,X, —X) only. 

This completes the proof that S* and X are independent. 

Problems 


1. Show that 


n 
bees BRP, 
i=l 
2. Consider the transformation 
yy =x 
y2 =xX2 —-X 


Vn =Xn — X 


(a) Show that 


X1 =yY1— y2-+++— Yn 
x2 =y1 + y2 
Xn =Y1 + Yn 


(b) Show that the Jacobian of this transformation is n. (Add all the rows to the 
first one to get a triangular matrix.) 


3. We proved in this section that $* and X are independent for a standard normal 
sample. In this problem we show that the result is true for any normal sample. Let 
Y1,..-, Y, be ani.i.d. sample of a normal with mean yw and standard deviation 
o.Fori = 1,...,n, let 


1 
X; = —(Y% — p). 
Oo 


(a) Show that x ; has a standard normal distribution fori = 1,..., 7. 
(b) Show that Y =o X + wp. 


2 The Sample Average Is Normal 273 


(c) Let 


St= 


lie a2 Use = 
Soi —Y) and S? = = bee =X 
i=l i=l 


n— | ¢ 


Show that S} = 0° Sy. 
(d) Show that Y and Se are independent. 


2 The Sample Average Is Normal 


2 


Let X1, X2...Xy be > id. with a normal distribution with mean jz and variance o 
The sample average X is defined by 


— 1 
X= —(X1 + X.4+--++ Xp). 
n 


* The sample average X of a normal sample with mean jy and variance o7 is 
normally distributed with mean jz and variance o7/n, where n is the size of 
the sample. 


We are going to compute the m.g.f. of X to find its distribution. Observe first 
1 1 1 
exp nee a = exp(” X1t)...exp Xni). 


Since X;, X2,..., X» are independent, so are exp(+ X11) ane exp(+X,,f). Hence, by 
taking expectations on both sides of the equality above we get 


1 1 
Myx(t) =Evtexp( X18) dese E(exp( Xnt)) 
1 1 
=Mx, Ch eee Mx, (~1). 


Now we use that X;, X2,..., X, all have the same distribution and hence the same 
m.g.f. Therefore, 


| n 
MA (mx Co) 
| n 
= (exnnt + 577-9) 


lo? 2 
=exp(ut + <=—t*) 
2n 


274 24 Sampling from a Normal Distribution 


Hence, the m.g.f. of X is the m.g.f. of a normal random variable with mean jz and 
variance o7/n. This proves our claim about the distribution of X. 


3 The Sample Variance Distribution 


First we recall two facts about the Chi-squared distribution. 
e Let Z,,..., Z, be i.i.d. standard normal random variables. Let 
X= Zi+...4+2Z?, 


Then, X has a Chi-squared distribution with n degrees of freedom (d_f. in short). 
¢ Let X bea Chi-squared distribution with n degrees of freedom. Then, the moment 
generating function of X is 


MO) = 7 ae 


We will show the following result. 


* Let S* be the sample variance of an i.i.d. normal sample. Then, 


@-DS 


has a Chi-squared distribution with n — 1 df. 


We start by writing S* using independent standard normal random variables. By 
using only algebra, we get 


Di% — X) = UG — w+ we XP 
i=l i=l 
=D — wy)? +2 1K — wu X) +n — XY? 
i=l 


i=1 


Note now that 


Yi - wu -X)=U-X) VG - wy) 
i=1 


i=1 


= (uw — X)(nX — np) 


3 The Sample Variance Distribution 275 


Hence, 
n n 
Di — X= DG = wy)? = 0K = py)’. 
i=l i=l 


We divide the preceding equality by o7 and write it as 
W+T=Yy, () 


where 


n 
= _ hoe 2 st ; 2 
W=(n Dot aoa LL) aE a 
— 


Observe that 1 (X,-—p)... 1 (X, — 4) are 1.i.d. standard normal random variables. 
Hence, Y has a Chi-squared distribution with n df. 


Since X is a normal random variable with mean jz and variance o7/n, fn (X—p) 
is a standard normal random variable. Therefore, T has a Chi-squared distribution 
with 1 df. _ 

Using that S? and X are independent, we get that W and T are independent. 
Taking moment generating functions in (1), 

Mw(t)Mr(t) = My(t). 
Therefore, 
My(t)( —21)7!? = (1 — 28)”. 
Hence, 


My(t) = (1 — 20)", 


That is, (n — 1)$*/o7 has a Chi-squared distribution with n — 1 df. 


276 24 Sampling from a Normal Distribution 


4 The Standardized Average 


Recall that: 


¢ Let Z be a standard normal random variable and Y be a Chi-squared random 
variable with n d.f. Assume that Z and Y are independent. Then, 


Z 
VY/n 


follows a Student distribution with n d.f. 


Thanks to the results of the preceding sections, we can state the following 
result. 


¢ Let X1, X2...X, bei.i.d. with a normal distribution with mean jy and variance 
o7. Let X and S? be the sample average and the sample variance. Then, 


X-p 
S/n 


follows a Student distribution with n — 1 d.f. 


We now prove this result. Let 


X-p 


o//n- 


Since X is normally distributed with mean jz and variance o7/n, Z is a standard 
normal random variable. Let 


n—1 


Y= s?. 


We know that Z and Y are independent and that Y follows a Chi-squared distribution 
with n — 1 d.f. Hence, 


Z 
VY/(n— I) 


follows a Student distribution with n — 1 d.f. Since 


Z _XK-pu 


V¥/a—D S/Jn’ 


the result follows. 


Problem 277 
Problem 


4. Let X)...Xp, be an iid. sample with a normal distribution with mean /14 
and variance o7. Let Y, ... Yn, be an i.i.d. sample with a normal distribution 
with mean j:2 and the same variance o*. Assume that the two samples are 
independent. 


(a) Show that the random variable 


X —Y — (1 — p12) 


is a standard normal. 


(b) Let ie and Se be the variances of the samples X;... Xn, and Y,...Yn,, 
respectively. Let 
es (im — IS} + (na — DS?) 
ny tn2—-—2 a ie 
Show that (nj + nz — 2)S?/o* has a Chi-squared distribution. With how 
many degrees of freedom? 
(c) Show that the random variable 


X —Y — (wu — 42) 


1 1 
Sy rane 


is a Student random variable with n; + nz — 2 degrees of freedom. 


Chapter 25 ®) 
Finding Estimators sei 


1 The Method of Moments 


Let yz be the mean of a certain distribution. We would like to estimate jz. A natural 
way to do that is to take a sample of the distribution and estimate jz by using 
the sample average. More formally, let X1, X2,..., Xn be an ii.d. sample of the 
distribution of interest that has mean jz. Then, an estimator of ju is 


- 1 
fi == + Xap + Xn). 


Note that 2 and X denote the same estimator. 

Is (1 the best way to estimate 44? What does “best” mean? What if I am not 
interested in estimating the mean but something else? In this and following chapters 
we will be concerned with such questions. We start by defining estimators. 


e A parameter such as jz will never be known. We use a statistic such as the sample 
average to estimate ju. This statistic (or estimator) is denoted by jz. Note that /1 is 
random and will change from sample to sample. The parameter jz is not random 
and is not to be confused with (2. 


We now introduce the method of moments. 


¢ Letk > 1. The k-th moment of the random variable X is defined by 
pie = E(X"). 


Example I Consider the exponential distribution with parameter @. It has probabil- 
ity density f(x) = 6 exp(—6x). Its first moment jz; is 1/6. Hence, 


6=—. 
LI 


© Springer Nature Switzerland AG 2022 279 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_25 


280 25 Finding Estimators 


The method of moments estimates 1 by using X. This yields 


n 1 1 
6 =, ==. 
zal x 
Another way to write the same thing is to say that given observations x1, x2,...,Xn 


of a sample, 


: 1 
O(x1,.--,X) == 
Xx 


where x = ‘ >, 47. This emphasizes that 6 depends on the sample alone. An 


estimator of 6 cannot depend on 6 of course! The estimator 6 is called the method 
of moments estimator (or m.m.e.) for 0. 

For instance, assume that we have the following 20 observations X1,..., X20: 
2.9983, 0.2628, 0.9335, 0.6655, 2.2192, 1.4359, 0.6097, 0.0187, 1.7226, 0.5883, 
0.9556, 1.5699, 2.5487, 1.3402, 0.1939, 0.5204, 2.7406, 2.4878, 0.5281, 2.2410. 
The sample average is X = 1.329. Therefore, the m.m.e. for 6 is 


~ | 
6 = = = 0.7524. 
x 


Example 2 Let X be a random variable with mean 0 and variance o*. What is the 
m.m.e. for 07? 

We know that E(X) = 0. Hence, the first moment is not helpful in estimating 
o”. Therefore, we compute the second moment. Since 


Var(X) = E(X’) — E(X)? = E(X”), 


E(X*) = o?. Let 


The method of moments uses X2 to estimate E (Xx 2), Hence, the m.m.e. for 07 is 


o2 = X2, 


Example 3 Let X be a discrete random variable with distribution 


P(X =k) = 1/6 fork in {1,2,..., 0}. 


1 The Method of Moments 281 


The first moment 


0 
pi = DI KP(X = kI8) 
k=1 


1 


6 
=) ‘k 


a ee 
92 


1 
= ~(6+1), 
56 + 1) 
where we use that 
1 
14+2+---+0= qo +h): 
Solving for 6, 
6=2p,-1 


and the m.m.e. in this case is 
6 =2f;-1=2X—-1. 


Assume that we have the following observations: 5, 2, 1, 3, 2, 2, 2, 6, 2, 3. The 
average is 2.8 and 6 = 4.6. 


Example 4 Consider a Poisson distribution with parameter 4. What is the m.m.e. 
for i? 

The expected value of a Poisson distribution with parameter 4 is 2. That is, 
E(X) = 4. Hence, the method of moments estimator for A is 


A=X, 


Example 5 Let X be distributed according to a Gamma distribution with parameters 
r and 2. What is the m.m.e. for (r, 4)? 
Recall that 


r 


E(X) = m1 = > and Bea 


2 


282 25 Finding Estimators 


We need to solve these two equations in r and 4. We write 12 as 


S ave r 
Be G+ 
and substitute 5 by 1; to get 
ae atl 
M2 = by + a 
Hence, 
pects, 
M2— by 
and 
2 
b 
r=) = —. 
M2— by 


This yields the following method of moments estimates 


~ ~2 
je cle and # = "1, 
M2 — by M2 — by 
where 
: I n ; l n . 
f= X; and jin = — ) 7X7. 
i=l i=l 
Problems 


1. Let X be a Bernoulli random variable with parameter p. That is, 
P(X = I|p) = pand P(X =O|p) =1-p. 


Find the m.m.e. of p. 

2. Let U be a continuous random variable uniformly distributed on [—0, 6]. Find 
the method of moments estimator for 6. 

3. Let U be a continuous random variable uniformly distributed on [0, 6]. 


(a) Find the method of moments estimator for 0. 
(b) Compute the m.m.e. given the following observations 0.1158 0.7057 1.6263 
0.0197 0.2778 0.4055 0.3974 1.2076 0.5444 0.3976. 


Problems 283 


4. The Pareto distribution has density 
f(x|0) = 0x9"! for x > 1. 


Assume that @ is an unknown parameter somewhere in (1, 00). 


(a) Check that f is indeed a probability density. 
(b) Find the m.m-e. of 0. 


5. Consider the continuous uniform density on [9 — 1/2, 6 + 1/2]. 
Find the m.m.e. of 0. 
6. Consider a geometric distribution with parameter p. That is, 


P(X =x|p) = (1— p)*"'p forx = 1,2,... 


Find the m.m.e. of p. 
7. Consider the Laplace distribution with density 


1 
PAE) eee a); 


Find the m.m.e. of 6. 
8. Consider the binomial distribution 


Pax = stk p= (8) pa pi" 


where & and p are unknown. 


(a) Let (k, P) be the method of moments estimator for (k, p). Show that 
kp =x 
kp — p) +p? =x?, 


where ¥ = 2 )~"_, x; and x2 = ya 
(b) Show that 
i x 
k =S —— — . 
¥— x2 4X? 
(c) Compute (k, p) for the following sample, 130231122112224012 
01. 


9. Consider an i.i.d. sample of a Gamma distribution with parameters r and i. 
Assume that r is known. 
Find the method of moments estimator for 2. 


284 25 Finding Estimators 
2 The Maximum Likelihood Method 


We now introduce a second method to find estimators. 


Example 6 I have two seemingly identical coins. One coin is biased with a 
probability of tails 3/4, and the other coin is honest with a probability of tails 1/2. 1 
pick one of the coins at random and toss it three times. I get two tails and one head. 
Is this the biased coin? I cannot answer this question with certainty, but I can decide 
which coin is more likely given the results of my experiment. If this is the biased 
coin, the probability of getting two tails and one head is (the number of tails in three 
tosses is a binomial random variable) 


5 27 
33/4)"(1/4) = = ~ 0.42. 


On the other hand if this is the honest coin, then the probability of getting two tails 
and one head is 


3(1/2)7(1/2) = 3/8 ~ 0.38. 


Hence, based on our sample it is more likely that we picked the biased coin than the 
honest coin. 

More formally, let p be the probability of tails of the coin we picked. We know 
that p can be either 3/4 or 1/2. We toss the coin 3 times and record each time whether 
we have heads or tails. This yields an i.i.d. sample of size 3. In this particular sample 
we get two tails and one head. The probability (or likelihood) of such a result is 


3p"(1 — p). 


The parameter p is known to have only two possible values. The likelihood turns 
out to be maximum at p = 3/4. Hence, the maximum likelihood estimator for p 
is 


p= 3/4. 


Let X1, X2,..., X» be ani.i.d. sample of a given distribution with parameter 0. 
The likelihood function L is defined to be 


¢ LO|x1,...,%) = P(X, = x1|0)P(X2 = x2|0)...P(Xn = Xy|0), if the 
distribution is discrete. 


© L(O|x1,..-,%n) = f(x1|0)... fOnlO), if the distribution is continuous with 
density f. 
e¢ Let Xj, X2,..., Xn be an i.i.d. sample of a given distribution with parameter 


6. Let L be the likelihood function of this sample. Given the observations 
X1,X2,...,Xy, the maximum likelihood estimator (or m.].e.) of 6 is the value 


2 The Maximum Likelihood Method 285 


6 that maximizes L. In other words, for all possible 6 we must have 
L(@|x1,...,%n) < L(6|x1,..-,Xn)- 


A word on notation. We use the notation f(x|@) and P(X = x|@) to 
emphasize that @ determines these functions of x. Similarly, the likelihood 
function L(6|x1,..., Xn) is a function of 9 determined by the sample observations 
(X1,.--5Xn)- 


Example 7 Consider the exponential distribution with parameter 0. In Example | 
we showed that the m.m.e. of @ is 


2 — 4 
a i « 
We now find the m.l.e. of 6. Assume that we have an i.i.d. sample X1, X2,..., Xn. 


The likelihood function is 


L(O|x1,..-,%n) = F119)... Fn) 
= 0 exp(—0@x1)...0 exp(—0x,,) 
= 6" exp(—O(x1 +++» + Xn)) 


provided x1, x2,...,Xn are all strictly positive (the support of the exponential 
distribution is (0, +-00)). The function L is 0 otherwise. Given x1,..., Xn), we want 
to find the maximum of L as a function of 0. Since L is differentiable, if it has a 
maximum, then the derivative of L should be 0 at that point. Instead of looking for 
a maximum for L, it is more convenient and equivalent to look for a maximum for 
In L. We have 


InL(@) =nInd — O6(x4, +--+ +X). 


Taking the derivative with respect to 0, 


d n 
FT oe q Gite +n). 


Hence, this derivative is 0 if and only if @ = i. The fact that the derivative is 0 at 

6 = 1/x does not necessarily imply that there is a maximum there. To determine 

whether we have a maximum we know from Calculus that there are at least two tests 

that we may use, the first derivative test or the second derivative test. We compute 
the second derivative 

2 

“znL@=-%, 


286 25 Finding Estimators 


which is strictly negative for all 6 and in particular for 9 = 1/x. By the second 
derivative test In Z and therefore L has a maximum at 0 = 1/x. Hence, we have 
found the m.l.e. of 0, and it is 


D> 
ll 
es] | 


The m.m.e. and the m.l.e. coincide for exponential distribution. 


Example 8 Consider the Poisson distribution with parameter A. For x = 
0,.1,.2,5.45 


x 


Xr 
P(X =x|A) = exp(—A)— 
x! 
Hence, the likelihood function is 


L(A x1, ---5%n) = P(X, = x1 |A)P(X2 = x2/A)... P(Xn = xn |) 
Ss 

= exp(—nd) ———— 
xy!.. 


? 
iy! 


where s = x1 + x2 +---+X,. Therefore, 
In L(A) = —nd. + 51nd — InQxy!...x,!). 


This function is differentiable for all 2 > 0. Taking the derivative with respect to A, 


aa a eee 
PT aie dee 


Hence, this derivative is 0 only for 1 = = = X. It is easy to check that the second 
derivative is always strictly negative and therefore the m.Le. for A is i = X. Once 
again the m.m.e. and m.].e. coincide. 

The next example shows that the m.].e. and m.m.e. need not coincide. 


Example 9 Let X be a discrete random variable introduced in Example 3. We 
have shown that the m.m.e. is 6n = 2X — 1. We now find the m.Le. for 6. Let 
X1, X2,..., Xn be ani.i.d. sample of the distribution with the corresponding obser- 
vations x1, X2,...,X,. The x; are natural numbers in {1, 2, ..., 0}. By definition the 
likelihood function L is 


1 
PAG aes 05M) 2 RU) PR |) so Pg By es 


2 The Maximum Likelihood Method 287 


Given x1, x2,..., Xn, we need to decide whether the function L (as a function of 0 
only) has a maximum and if so where? We have 


1 
LO) = 5, if 8 > 11,6 > 12,...,0 2 an. 


If the condition 9 > x1,0 > x2,...,0 > X,y fails, it means that one of the 
observations is strictly larger than 6. That cannot happen, and therefore if this 
condition fails, we have L(@) = 0. It is easy to see that the condition 9 > 
x1,0 > x2,...,0 > x, is actually equivalent to 9 > max(xj,...,Xn). Let 
X(n) = Max(xX1,..., Xn). In summary, 


1 
£65 = or for 6 > x(n) 
0 foré < xq) 


Note that L is a decreasing function of @ for 6 > x(n). Hence, the maximum of 
L occurs at @ = x(,). Therefore, the maximum likelihood estimator for 6 is 


61 = Xi) = Max(X],..., Xn). 
Observe that it is quite different from the m.m.e. Taking the numerical values from 


example 3, we get 61 = 6, while the m.m.e. was shown to be 4.6. 


Example 10 Consider an i.i.d. sample of a normal distribution with mean jz and 
variance o7. We assume that both yz and o are unknown parameters. Recall that the 
density is 


(x — pw)? 


1 
exp( 
210 267 


f(x|M,0) = ). 


Hence, the likelihood function is 


L(, 03 X1,.-.,Xn) = (x1 HW)? ++ Gn = 0?)). 


1 1 
Faron (a 


Thus, 
1 
InL(w,o) = er) (cx _ pL) +---+ (ay - 1”) —ninv2z —nIno. 
oO 
Therefore, 


In L(qt, 0) = — 550? + Gu xi — —— —ninv2zn —nIino. 


288 25 Finding Estimators 


Observe that as a function of jz with o fixed L is a quadratic function. The leading 
term is negative, and therefore, this is a downward parabola with a maximum at the 
critical point. The critical point is found by solving the equation in In L = 0. This 
yields 


sale 


fe 


n 
Ss KX, 
i=l 


We now find the value of o that maximizes L(ji, 0). Taking the derivative with 
respect too, 


d ‘KR n 1 Z A 
Fo MEA o)= . + 53 ZY bh). 
We find that a In L(f, o) > 0 if and only if 
1 n 
ee iC eat 
i! i=1 
By the first derivative test the maximum of L(j1, 0) occurs at 
eee ghet 
oF = — Sai — i. 
i=1 


In summary, In L is maximum for (ji, «). This is the m.Le. of (41, o). 


Problems 


10. Let X be a Bernoulli random variable with parameter p. 
(a) Show that the distribution of X can be written as 
P(X =x\|p) = p*(— p)!™* for x =0or 1. 


(b) Use (a) to find the m.Le. of p. 


11. The Pareto distribution has density 
f (x|0) = 6x9"! for x > 1. 


Assume that 6 is an unknown parameter somewhere in (1, oo). Find the m.Le. 
of 6. 


Problems 289 


12. 


13. 


14. 
15. 


16. 


17. 


18. 


Consider the uniform distribution in [0, 6]. 


(a) Show that 


1 

=, ford> 
L@|x1,---,%n) = 48 mk 
0 for 0 < x(n) 


where xX(,) is the maximum of all observations. 
(b) Graph L as a function of 6. 
(c) Use (b) to find the m.l.e. of 6. 


Consider the continuous uniform density on [9 — 1/2, 6 + 1/2]. 


(a) Given observations x1 ...X,, Show that the likelihood function is 
L(@) = Lif max(x,...,%,) —1/2 < 6 < min(y,...,x%,) + 1/2 


and 0 otherwise. 
(b) Show that 6; = 5(min(x1, wy) Xn) + max(x1,...,Xy)) is an m.Le. 
(c) Show that there are infinitely many m.l.e.’s of @ in this case. 


Consider a geometric distribution with parameter p. Find the m.Le. of p. 
Let x1, x2,...,X, ben real numbers. Let g be the function 


n 


80) = > (i — 6). 


i=1 


Show that g attains its minimum at 


ee | 
@=x=- Xj: 
n* 
i=1 


Consider a normal distribution with mean 6 and variance 1. Find the m.l.e. for 
6. 
Consider the Laplace distribution with density 


1 
F 0) = 5 exp(—|x — 6). 


(a) For an odd sample size, find the m.].e. of 0 (use that g(@) = a |x; — | 
attains its minimum for @ equal to the median of the sample). 

(b) Use the following observations to compute the m.].e. of 6: 4.1277 —0.2371 
3.4503 2.7242 4.0894 3.0056 3.5429 2.8466 2.5044 2.0306 —0.1741 


Let f be a strictly positive function. Show that f is maximum at a if and only 
if In f is maximum ata. 


290 


19. 


20. 


25 Finding Estimators 


Consider an i.i.d. sample of a Gamma distribution with parameters r and i. 
Assume that r is known. 
Find the maximum likelihood estimator for 2. 


Let x1, x2, ..., X2n41 be 2m + | real numbers. Define the function f by 
2n+1 
fO)= Do lei - 41. 
i=1 
Let m be the median of the sample x1, x2, ..., X2n41. That is, there are as many 


observations below m as there are above m. The median is unique because our 
sample size is odd. We will show in this exercise that f is minimum for 0 = m. 
Leta > 0. 


(a) Show that 


|x; — m| +a for x; < m 
Ixj —(m+a)|=4— |x; —m|+aform <x; <m+a 


|xj —m|—aforx; >m+a 


(b) Let A be the set of indices i such that x; < m, B be the set of indices i such 
that m < x; < m-+a,and C be the set of indices i such that x; > m +a. 
Show that 
f(mt+a) = (xi —m|t+a)t+ (lai —m| t+a)t D(x; —m|-). 

icA icB ieC 

(c) Use (b) to get 

f(m+a) = fim) +a(lA| + 1B] IC) — 2) xi — ml, 
ieB 
where | A| is the number of elements in A. 
(d) Using that if i is in B, then |x; — m| < a, show that 
f(m +a) > f(m) + a(JA| — |B] — |C)). 
(e) Show that |A| > |B| + |C| and conclude that 


f(m +a) > f(m). 


(f) Redo the steps above to get f(m — a) => f(m) and conclude that f is 
minimum at m. 


Chapter 26 m®) 
Comparing Estimators sei 


1 The Mean Squared Error 


In order to compare two estimators of the parameter 6 we will measure how close 
each estimator is to the parameter we want to estimate. If estimator 1 is closer to 
the parameter than estimator 2, then estimator 1 is said to be better than estimator 
2. Hence, “better” depends heavily on what the measure of closeness is. There are 
many possible choices to measure closeness but by far the most used is the mean 
squared error (or quadratic mean distance) that we now define. 


* Assume that 6 is an estimator for 0. The mean squared error between @ and 6 
is 


d(6, 0) = E[(6 — 6)’. 


Note that 6 is a random variable (it will change from sample to sample) so (6 — 
6)* is also a random variable. To measure how close 6 is to 8, we would like a 
non-random number. This is why we take the expectation of (6 — 6)? to define 
d(6, 0). Another natural choice for d(@, 0) is E(\6 — 6|). The problem with this 
choice is that it is a lot less convenient mathematically. This is why we will stick 
to the mean squared error. 


We now recall two formulas that will be often useful to compute the mean 
squared error (or M.S.E.). Let X1, X2,..., X, be ani.i.d. sample of a distribution 
with mean ju and variance o”. Then, the sample average is defined by 


= 
= (K++ Xn). 


© Springer Nature Switzerland AG 2022 291 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_26 


292 26 Comparing Estimators 


* The expected value and the variance of X are 


= — o2 
E(X) = wand Var(X) = —. 
n 


Example I Let w and o” be the mean and variance of a given distribution. Assume 
we use X to estimate jz. Then, 
d(X, 1) =E((X — 1)”) 
=Var(X) 


If we use X (the first observation) instead to estimate jz, then 


d(X\, 2) =E((X1 — )) 
=Var(X1) 


=o?. 


Since d(X, ) < d(X1, w) for all n > 2, X is a better estimator of yu than X 
is. This is not surprising given that X uses a lot more information from the sample 
than X, does. 


2 Biased and Unbiased Estimators 


* Let @ be an estimator of the parameter 9. Then, 6 is said to be an unbiased 
estimator of 6 if E(@) = 0. 


For instance, for an i.i.d. sample of a distribution with mean «2 we know that 
E(X) = wu. Therefore, the sample average X is an unbiased estimator of ju. 
The following formula is very useful in computing a M.S.E.: 


* Let 6 be an estimator of the parameter 0. Then, the mean squared error can be 
written as 


E(6 — 6)2) = Var(6) + (z6) zs e) 


We will prove the formula in the last section of this chapter. 


3 Two Estimators for a Normal Variance 293 


i #) 
This formula has a nice conceptual value. We can think of (EO) _ 6) as the 


bias of the estimator. Then, the formula states that a M.S.E. has two components: 
the variance of the estimator and the bias of the estimator. As we see below, the 
formula becomes particularly simple when we have an unbiased estimator: 


¢ If 6 is an unbiased estimator of the parameter 0, then 


E((6 — 0)”) = Var(6). 


3 Two Estimators for a Normal Variance 


Let X1,..., X, be ani.i.d. sample of a distribution with variance a”. We know that 
the estimator 


ea (xi -— x)? 


n— 1 +4 
i=1 


is an unbiased estimator of o7. That is, E (S*) = o7. We now introduce a closely 
related estimator for 7. Let 


fen Yet _ 
ot = : Yi(xi - XY. 


i=1 


We are going to compare S* and o? in the particular case of a normal distribution 
with mean jy and variance o”. Recall that for a normal sample (n — ys has a 
chi-squared distribution with n — | degrees of freedom. Using this fact, it is easy to 
show (see the problems) that 


2 
Var(S?) = ot, 
n—-1 


Since S* is an unbiased estimator of «7, the corresponding mean squared error is 


a) 
d(S*, 07) = Var(S*) = of. 


n—1 


294 26 Comparing Estimators 


— 


We now turn to the mean squared error of a2. Recall that if a is a constant and X a 
random variable, then Var(aX) = a?Var(X). Using that 


We now compute the bias of 2. Since 


a -1 
E(o2) = tS 
n 


pes = 2 
(E@% - 0)’ =(“ ~o?~07) 
n 
Z 
sind (“ ae i) 
n 


Therefore, the mean squared error of o? is 


d(g2, 0?) =Var(e2) + (£0?) Ss 0) 


It is easy to see that for all n > 2, 
d(o2, 0?) < d(S*, 0°). 


That is, according to the mean square error o? is a better estimator than S* even 
though o? is biased and S? is unbiased. 


4 Two Estimators for a Uniform Distribution 295 


4 Two Estimators for a Uniform Distribution 


Consider an i.i.d. sample of a uniform distribution on (0,6). The method of 
moments estimator of 0 is 


Om = 2X. 
Since E(X) = p and the expected value of a uniform distribution on (0,6) is 
w= 6/2, 

Em) = 9. 


That is, On is an unbiased estimator of 0. 
Hence, the mean squared error for 0, is 


d(6m,9) =Var(@mn) 
=Var(2X) 


=4Var(X) 
2 


oO 
=4—, 
n 


where o” is the variance of a uniform on (0,0). Since 0? = 6? /12, the mean 
squared error for 0, is 


re g2 
d(O@m,0) = —. 
3n 


We have shown in the previous chapter that the maximum likelihood estimator 
for 6 is 


6; = max(x1, XQ, 2205 Xn) =X(n)- 
We are going to compute the mean squared error of 6; in order to compare it to 
Om. We first compute the distribution function F(,) of X(,). For x in ((0, @), the 
distribution function F(,) of X(j) is defined by 


Fay (x) = P(Xq < x). 


Using that the largest observation is less than x if and only if every observation is 
less than x, 


Fa) (x) = P(X, <x,...,Xn <x). 


296 26 Comparing Estimators 


Since the random variables X1,..., X, are independent, 
Fay Q)=P Xt 3x)2-.P On =x) = FO), 


where F is the distribution function of a uniform distribution on (0, 0). By definition 
of F, for x in (0, 0), 


sa | x 
F(x) = P(X, a=") —dx = -. 
0 O 6 


Hence, for x in (0, 0), 
x 
Fn) (x) = Gs 


By taking the derivative (with respect to x) of Fin), we get the density function f(n) 
of X(). For x in (0, 8), 


n es 
fay) = oat . 


We now use this density function to compute the mean squared error of Xn). 
Starting with the expected value, 


E(X y= [xR lax = z 7 
(n)) = ‘ gn = n+1 . 
For the second moment, 
2 ot 1 any) 
E(Xq) = x° — x" dx = 
0 gr n+2 
Hence, the variance of Xn) is 
n n n 
Var(Xqy) = 0? 6) = 6. 
eNO = a” ag, = Ge aye 


The bias of 6) is given by 


(BOG = 8) = + by = 9 
n+l (n+ 1)? 


5 Proof of the M.S.E. Formula 297 


Hence, 


d (6), 0) = Var(X (ny) + (EX) — 07 


a 
~ (n+ 2)(n + 1)? (n+ 1)2 
2 2 


= —___—_9@ 
(n+ 2)(n + 1) 
It is easy to check that for all n > 3, 


2 1 
“+2@+41) 3n’ 


That is, form > 3 and every 6 
d(1,0) < d(6m, 9). 


Therefore, according to the mean square criterion 6) is a better estimator than Am: 


5 Proof of the M.S.E. Formula 


We now prove that 


A i i 2 
E(@ - 6)"] = Var) + (EO) —8) . 
In order to prove this formula it is important to remember that the expectation is 
linear. That is, E(aX) = aE(X) and E(X + Y) = E(X)+ E(Y), where X and Y 
are random variables and a is a real number. It is also important to realize that 6 is 


a random variable, while E (6) and @ are numbers. Moreover, the expected value of 
a number is the number itself. Let t = E(@). A little algebra shows 


6-6 = (@-141-0) 
= (6-17 +26 -—n¢-—6)+(¢- 98). 
Note that 
E((@ —1)*) = Var(6) 


and since (t — @)? is a number it is equal to its expectation. Hence, the formula is 
proved, provided the expectation of (9 — t)(t — @) is 0. We now show that. The 


298 26 Comparing Estimators 


product is 
(6 —1)(t — 6) = 6t — 00 — 17 +10. 
Take expectations across the equality above to get 
E((@ —1t)(t — 0)) = E(6t) — E(60) — 17 +10. 
By linearity of the expectation, E (6t) =? andE (60) = 0t. Therefore, 
E((6 —1)(t —6)) =0 


and the formula is proved. 


Problems 


1. Assume that 6; is an estimator of 6 with the following properties: 


A n A n 
E(6;) = ——6 and Var(6,) = ———————6” 
(01) Thea ar (0) @i Dat dD? 
Let 


x les 
es ba 
n 


(a) Show that 6, is an unbiased estimator of 0. 
(b) Is 62 a better estimator than 0,? 


2. Consider an 1.i.d. sample of a uniform distribution on (0, 6). We have shown that 
6 = 2X is an unbiased estimator of @ with variance es The purpose of this 


exercise is to show that even if we limit our search to estimators of the form aX, 
then 6,, is not the best choice. That is, there are better choices than a = 2. 


(a) Show that 
= 62 
d(aX, 0) = a*»— 
aes aa CF 


a 
Oi] =i)", 
seca 5 ) 
(b) Find ag so that d(aX, 0) is minimum. The estimator ajX is the best in the 
class of estimators aX. 
(c) Show that the estimator found in (b) is biased. 


Problems 299 


3. Consider a family of exponential distributions with expectation 0. That is, 
1 -1 
f(x|@) = Fi a Gr for x > 0. 


(a) Show that X is an unbiased estimator of 6. - 
(b) Find the best estimator of 6 in the class of all estimators aX. 


4. Consider the family of distributions with density 


) for x € (—oo, +00). 


(x1) = = exp 
FOIE) = 55 xP 
(a) Compute E (|X|). 

(b) Find an unbiased estimator for 6. 


5. Consider the Pareto distributions 


f(x|@) = 


6 
deo > 0, 


where @ is in (1, 00). 


(a) Compute E(X). 
(b) X is an unbiased estimator of what parameter? 


6. (a) Let X be normally distributed with mean 0 and variance o. Find E(X”) and 
Var(X?’). (Use that X?/o* has a chi-squared distribution with 1 degree of 
freedom.) 

Consider a family of normal distributions with mean 0 and standard deviation 
o. Our parameter is 6 = a”. Show that 


1 n 
A yx? 
i= 


(b 


wm 


is an unbiased estimator of 0. 
(c) Show that 


A ot 
Var(0) =2—. 
n 


300 26 Comparing Estimators 


(d) Another unbiased estimator of 0 is 


i = 
oe -_y2 
S per PCL XxX). 
i=1 
Which one is a better estimator, 6 or S22 (Use that d(S?, 07) = 7 04,) 
7. Recall that for a normal sample (n — ys has a chi-squared distribution with 
n — | degrees of freedom. Show that E (S*) = 0? and 


2 
Var(S*) = ae 


n—- 


Chapter 27 ®) 
Best Unbiased Estimators hook for 


1 Exponential Families of Distributions 


A family of probability distributions with parameter 0 is said to be an exponential 
family of probability distributions if the following two conditions are met: 


¢ The support of the probability density (i.e., the set of x’s such as f(x|@) > 0) 
does not depend on 0. 
e The logarithm of the density can be written as 


In(f (x|@)) = a(@)t(x) + b@) +r@) 


The definition above is stated for a continuous family. For a discrete family, the 
same definition applies if we replace f(x|9) by P(X = x|@). 
Example I consider a family of Bernoulli random variable with parameter p. That 
1S, 


P(X = x|p) = p*(1— p)'™* 


where x = 0 or x = 1. The support of the distribution is {0, 1} and therefore does 
not depend on p. For x = 0 or 1, 


In(P(X = x|p)) =xInp+ (1 —x)Ind — p) 
Pp 
Lp 


=x In(—"_) + In — p) 


© Springer Nature Switzerland AG 2022 301 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_27 


302 27 Best Unbiased Estimators 


We can set 
a(p) =In(; =>) 
b(p) =In(1 — p)) 
t(x) =x 
r(x) =0 


This shows the collection of Bernoulli distributions with parameter p is an 
exponential family. 


Note that there are other possible choices for the functions a and f. In general, 
we take ¢ to be as simple as possible. 


Example 2 Consider the family of normal distributions with mean jz and standard 
deviation 1. For x in (—co, +00), 


i ( (x — Hy") 
ex ; 
J/2n 2 
The support of the distribution is the whole real line and therefore does not depend 
on jl. 
Taking the logarithm, 


f@lm) = 


In(f (x|m)) = x — 7/2 — In(V2m) — x7/2. 


We can set 
a() =" 
b(u) =u" /2 
t(x) =x 


r(x) = — In(V/2z) — x?/2. 


Therefore, the family of normal distributions with mean yw and standard deviation 1 
is an exponential family. 


Example 3 Consider X the family of uniform distribution on (0, @). The density is 
defined by 
f (x19) 
x|0) = -, 
0 


2 Minimum Variance Unbiased Estimators 303 


for all x in (0,0) and f(x|@) = 0 for all other x. In this case the support of the 
distribution is (0, 6) and depends on @. Hence, the family of uniform distributions 
is NOT an exponential family. 


2 Minimum Variance Unbiased Estimators 


There is no estimator of a parameter 0, which is the best for the whole range of 
possible values for 6. To see why assume that 5 is a possible value for 6 and let 
6 = 5 be anestimator of 0. It is a terrible estimator; for any sample, the estimator is 
always the same! However, when @ is actually 5 the estimator 6 = Sis perfect, the 
mean squared error is 0! No other estimator can do as well for that particular value 
of @. This illustrates why there is no best estimator. However, if we look for the best 
estimator in the smaller class of unbiased estimators, then we may find one. 


¢ A minimum variance unbiased estimator (m.v.u.e. in short) is an estimator 
whose mean squared error is smaller than the mean squared error of any other 
unbiased estimator for the whole range of values of 6. In this sense, the m.v.u.e. 
is the “best” among unbiased estimators. 


The next result gives a recipe to find a m.v.u.e. for an exponential family of 
distributions. 


¢ Let X,..., X, be ani.i.d. sample of an exponential family of distributions. That 
is, the probability density can be written as 


In(f (x|9)) = a(@)t(x) + b@) +r). 


Let t(0) = E(t(X1). Then, the statistic 


n 


1 
R= — dot (X) 


i=1 
is am.v.u.e. for t(@). 


This is a particular case of the Lehmann—Scheffé Theorem. We will apply this 
important theorem to exponential families several times in this chapter. For a proof 
of the theorem, see Mood, Graybill, and Boes (1974). 


Example 4 Consider an i.i.d. sample of exponential random variables with param- 
eter 0. That is, for x > 0 


f (x|0) = 6e7**. 


Can you find a m.v.u.e. for 6? 


304 27 Best Unbiased Estimators 


We first check that this is an exponential family of distributions. Note that the 
support is (0, +00) and therefore does not depend on @. 


In f(x|0) = Ino — Ox. 


Let a(0) = —6, b(@) = Ind, t(x) = x and r(x) = 0. This shows that this is indeed 
an exponential family of distributions. 
The expected value of t(X1) = X1 is 


1 
1(0) = E(X1) = 5. 


Hence, the statistic 


is am.v.u.e. for T(@) = 1/0. Note that this method yields a m.v.u.e. for 1/0 but not 
for 6. 


Example 5 Consider the Poisson distribution with mean 6. That is, 
x 


é 
P(X =x|0) =e °—, 
x! 


for x = 0, 1, 2,.... Can you find a m.v.u.e. for 6? 
Note that P(X = x|@) > O if and only if x is a positive integer or 0. Hence, the 
support of the distribution does not depend on 6. 


In P(X = x|0) = -0 +x 1In@ — Inx! 


Set a(@) = In(@), b(@) = —O, t(x) = x, and r(x) = —I1nx!. Therefore, the family 
of Poisson distributions is exponential. 
The expected value of t(X1) = X1 is T(0) = E(X1) = 9. Hence, 


1 oe 
a 
i= 


is am.v.u.e. for T(0) = @. 
We will sometimes need a more general method to find the m.v.u.e. for an 
exponential family. We now state this method. 


e Let X,..., X, be ani.i.d. sample of an exponential family of distributions. That 
is, the probability density can be written as 


In(f (x|9)) = a(@)t(x) + b@) +r). 


2 Minimum Variance Unbiased Estimators 305 


Let 
n 
P= Soy) 
i=l 


and g > 0 be a function. Then, the statistic R = g(T) isa m.v.uwe. for E(R). 


Note that in the preceding examples we have applied this method in the particular 
case when g(T) = T/n. In the next example we will use a different g. 


Example 6 Consider a family of Poisson distributions with parameter 6. We want 
to find a m.v.u.e. for e~®. 
Consider an i.i.d. sample of this Poisson distribution of size n. Let 


where T = )-/_, X;. Since we are dealing with an exponential family of 
distributions, we can apply the result we just stated for g(t) = (a1) . Hence, R 
is the m.v.u.e. for E(R). We now compute E(R). 


E(R) = E(g(T)) = D0 gp) PT = 0). 


t=0 


Recall that a sum of independent Poisson random variables is a Poisson random 
variable whose rate is the sum of the rates. Therefore, T is a Poisson random variable 
with rate n@. Hence, 


E(R) =) g@)P(T = 1) 


t=0 
oO #4 t 
=) exp—nay 
ye t! 
°°. ((n — 16)! 
= exp( nd) y> ( rf J) 


t=0 
= exp(—n6) exp((n — 1)0) 
= exp(—0). 


Hence, R is the m.v.u.e. of exp(—@). 


306 27 Best Unbiased Estimators 
Problems 


1. Consider the family of normal distributions with mean 0 and standard deviation 
o. Find a m.v.u.e. for o?. 

2. Consider the family of normal distributions with mean jz and standard deviation 
o = 1. Findam.v.u.e. for pw. 

3. Consider the family of normal distributions with mean jz and known standard 
deviation o. Is X a m.v.u.e. of 2? 

4. Consider a family of exponential distributions with expectation w. That is, for 


x>0 


1 1 
f(x|a) = — exp(——x). 
a a 


Show that X is a m.v.u.e. of a. 
5. Consider an i.i.d. sample of random variables with density 


® exp(—Alxl) 
— exp(—@|x]), 
5) p 


for all x. 


(a) Show that this is an exponential family of distributions. 
(b) Find a m.v.u.e. For what parameter? 


6. Consider an i.i.d. sample of random variables with density 
(0+ 1)x° for0 <x <1. 
Find a m.v.u.e. For what parameter? 
7. Consider X1, X2,..., X» ani.id. sample of the exponential distribution with 
parameter 6. That is, the density is 


f(x|0) = 6e~* for x > 0 


and 0 otherwise. Let T = )~_, X; and 


R= 


T 


(a) Show that R is an unbiased estimator of 6. (Use that E(1/T) = 0/(n—1).) 

(b) Show that R is a m.v.u.e. for 6. 

(c) Show that E(1/T) = 0/(n — 1). (Use that a sum of n i.i.d. exponential 
random variables is a Gamma random variable.) 

(d) Show that Var(R) = 67/(n — 2). 

(e) Explain why 1/X is a natural choice to estimate 6. Compare it to R. 


3 Sufficient Statistics 307 
3 Sufficient Statistics 


In this section we will see how it is sometimes possible to improve an estimator. The 
central idea is sufficiency. 


e Let X1, X2,...,X» be an iid. sample of a distribution with parameter 0. 
A statistic T is said to be sufficient for 6 if the conditional distribution of 
X1, X2,..., Xn given T = t does not depend on 0. 


The main idea behind this definition is that all the information about 0 is 
contained in the sufficient statistic T. There is no need to look at the whole sample 
(X1, X2,..., Xn), and it is sufficient to only look at T. 


Example 7 Consider X\, X2,..., X, ani.i.d. sample of the Bernoulli distribution 
with parameter 6. Show that 


a sufficient statistic. 
Recall that the conditional probability of A given B is defined by 


P(ANB) 
POR S 
P(B) 
Let 
A={X, =x],...,Xn =Xn} 


and B = {T = ft}. Using that T is the sum of the X;, we see that the event A is 
included in the event B provided that t = }“_, x;. Hence, AM B = A and 


P(A) 


P(A|B) = eT 


Since T is a sum of i.i.d. Bernoulli random variables, it is a binomial random 
variable. We have fort = 0,1,...,n 


P(T =t)= (") eu —eyt, 


Using that X;,..., X, are independent, 


P(A) =P(X, = x1,..., Xn = Xn) 
=P(X, = x1)... P(Xn = Xn) 


308 27 Best Unbiased Estimators 


=6"'(1—6)"!...0™(1 — 0) 


=6'(1 = oy" 
Hence, 
ef. —6)"* 
P(A|B) = ( ) 
}) et(1 —@yr-t 
3 1 
ee 
t 
Note that all the 6’s cancel. The conditional distribution of X1, X2,..., Xn 


given T = ¢ does not depend on @. The statistic T is therefore sufficient. What 
is remarkable here is that the single number JT summarizes all the information 
regarding 0, which is contained in the (usually) very long vector (X1, X2,..., Xn). 


4 A Factorization Theorem 


The following method is useful in finding sufficient statistics. 


¢ Factorization Theorem. Let X,, X2,..., X, be ani.i.d. sample of a distribution 
with parameter 0. A statistic T is sufficient if and only if the likelihood function 
L can be factored as follows: 


L(x, ..-,Xn|0) =f (11/0)... f nl9) 
=9(0; T)h(x1,...,Xn), 


where g and / are positive functions. 


In words, the statistic T is sufficient if and only if Z can be factored as a product 
of a function of T and @ and a function that does not depend on @. For a proof of 
this theorem, see Casella and Berger (2002). 

We now use the factorization theorem to find a sufficient statistic for an 
exponential family. 


¢ Consider the probability density f(x|@) such that 


In f(x|0) = a@)t(x) + b@) +r), 


4 A Factorization Theorem 309 


whose support A does not depend on @. Then, the statistic 


T= 4%) 
i=1 


is sufficient for 0. 


The result above holds for discrete exponential families as well. The role of 
F (x|@) is then played by P(X = x|@). 

We now prove this result. The likelihood function for an i.i.d. sample with this 
distribution is for x1,...,X, in A: 


L(x1, ..., Xnl0) =exp @ So t(xi) + nb@) + a) 


i=l i=l 


= exp (a ~ t(x;) + a) exp (> rs) ; 


i=l i=l 
Let T = )7"_, ¢(X;), and let 
g(O; 5) = exp (a(O0)s +nb(O)). 


Set h(x1,...,%) = exp() j_) r(x;)) if all x; are in the support A and 0 otherwise. 
Recall that the support A does not depend on 6 and therefore h does not depend on 
6. Then, 


L(xq,..-,Xn|0) = 9 (0; T(x, ..-, Xn))A,.--, Xp). 


By the factorization criterion, T is a sufficient statistic. 


Example & Consider the family of normal distributions with mean jz and standard 
deviation |. We have shown already that this is an exponential family with t(x) = x. 
Hence, 


n n 
P= 10S Xs 
i=1 i=l 


is sufficient for jw. 
Next we give an example of a sufficient statistic for a distribution, which is not 
an exponential family. 


Example 9 Consider X1, X2,..., Xn ani.i.d. sample of the uniform distribution in 
(0, 9). Find a sufficient statistic for 6. 

Note that this is not an exponential family of distributions (the support (0, @) 
depends on @). 


310 27 Best Unbiased Estimators 
The likelihood function is 
1, 
L(x, .--,Xn|9) = pn if all x; € (0, 0) 


and L(x1,...,Xn|@) = 0 otherwise. Let x(n) be the largest of all x;’s. We can rewrite 
the function L as 


1 

sr ford> 
L(x, ---,Xn|0) = 4 A) 

0 ford < xq) 


Let T(x1,...,%n) = Xm) and 


1 
or ford>t 
gta pm = 
0 ford <t 
Set h(x1,...,Xn) = lif all x; are positive and 0 otherwise. We have 


L(x, ..-,Xn|0) = 9 (0; T(x, ..-, Xn))A, ..-, Xp). 


The factorization theorem shows that T = Xn) is a sufficient statistic for 0. 
We now give an example of sufficient statistics for a two dimensional parameter. 


Example 10 Consider the family of normal distributions with mean jz and standard 
deviation o. Both yw and o are unknown and therefore we have a two dimensional 
parameter 0 = (1, 0). The likelihood function in this case is 


1 ‘Oe 5 
LOO lt, +4 8n) = ae apa eXP— 3 De H)”). 


After expanding the squares inside the exponential, we get 


1 15 . ; 
L(Blx15 «+5 n) = aga EXP 29? 2% — 2p Do xi + nu’). 


i=1 


Let the function g be 


g(5,t, Lo) = exp( (s — 2ut + np”)). 


ao" (2r)n/2 202 


It is easy to check that 


n n 
EO; nig tnY = BO- os D-H SY: 
i=l i=l 


5 Conditional Expectation and Sufficiency 311 


It turns out that the factorization theorem stated above holds for multi-dimensional 
statistics. Hence, the two dimensional statistic 


n n 
(> x7, 0 Xi) 
i=1 i=1 


is sufficient for (uw, 0). 


5 Conditional Expectation and Sufficiency 


In Example 6 we considered a Poisson distribution with parameter 6. We showed 
that R = (etyr is a m.v.u.e. for e~*. But how did we come up with R? The 
following result explains the method. 


¢ Consider an exponential family of probability distributions with parameter 0. 
Let R be an unbiased estimator of (9). Let T = )°/_, t(X;) be the sufficient 
statistic for 0. Then the conditional expectation E(R|7) is a minimum variance 
unbiased estimator (m.v.u.e.) of h(@). 


This is a rather remarkable result. Take any unbiased estimator of h(@), and by 
taking the conditional expectation with respect to the sufficient statistic, we get a 
m.v.u.e.! 

Next we revisit Example 6. 


Example 11 Consider a family of Poisson distributions with parameter 6. Let 
h(@) = exp(—@). We want to find a m.v.u.e. for h(0). 
The first step is to find an unbiased estimator. For an i.i.d. sample X;...X,, let 
R = 1if X; = Oand R = Oif X, > O. This is not a great estimator! We will 
estimate exp(—@) by 0 or | depending on what X, is and ignore the rest of the 
sample! However, it is an unbiased estimator and this is all we need to get started. 
Note that 


P(R = 1) = P(X; = 0) = exp(—8). 


Since R is a Bernoulli random variable E(R) = P(R = 1). Hence, R is an unbiased 
estimator of h(0). 
We have seen already that 


312 27 Best Unbiased Estimators 


is a sufficient statistic for a Poisson family of distributions. We now compute 
E(R|T). Since R is a Bernoulli random variable, we have for any positive integer ¢ 
P(X, =0;T =f) 
P(T =t) 


E(R|T =1) = P(R=1/T =) = 


Note that {X; = 0; T = ft} can be written as the intersection of two independent 
events: 


(X1 =0;T =1}=(X1 =O} |X = 0. 


i=2 


Moreover, recall that a sum of independent Poisson random variables is a Poisson 
random variable whose rate is the sum of the rates. Therefore, }~"_, X; is a Poisson 
random variable with rate (n — 1)@ and T is a Poisson random variable with rate n@. 
Hence, 


n 


P()) Xj =) = exp(-(n— 18) 


i=2 


((n — 1)6)' 


t! 


and 


(n0)' 


P(T = t) = exp(—né) 7 


Thus, 


exp(—0) exp(—(n — 1)0)((n — 1)0)'/t! (1 1 
n 


ERT == exp(—n0)(n0)!/t! 


vs 
Therefore, a m.v.u.e. for exp(—@) is (E1)T, 


Example 12 Can we find a m.v.u.e. for the uniform distribution on (0, 6)? 
By Example 9, the statistic T = Xn) is sufficient. In a preceding chapter we 
computed 


n 
E(T) = E(x = ——_8, 
( ) ( (n)) n+ 


Let 


5 Conditional Expectation and Sufficiency 313 


Then, R is an unbiased estimator of 6 and is a function of the sufficient statistic T. 
This turns out to be enough to show that R is a m.v.u.e. for 6; see Mood, Graybill, 
and Boes (1974). 


We have chosen to concentrate on exponential families in this chapter because 
this is where most of the applications are and the theory is simpler. Next, we state a 
result that illustrates why sufficiency is a powerful tool in general. 


¢ Rao-Blackwell Theorem. Consider a distribution family with parameter 0. 
Assume that it has a sufficient statistic T. Let R be an estimator of 0, and let 
R' = E(R|T). Then R’ is a better estimator than R in that 

d(R’,0) < d(R, 9), 


for all 0. 


In words, one always improves an estimator by computing its conditional 
expectation with respect to a sufficient statistic. 
The proof is based on the following property of conditional expectation. 


Lemma For any random variables X and Y, we have 
Var[E(X|Y)] < Var(X). 

That is, the variance of the conditional expectation is less than the variance of the 
(unconditioned) variable. We now prove the lemma. 

We will use several times the following property of conditional expectation. For 
any random variables X and Y, 

E((E(X|Y)] = E(x). 
By definition, 
Var(X) = E(X’) — E(X)’. 
Hence, 
Var(X) = E[E(X?|Y)] — E[E(X|Y)P. 
We now subtract and add E[E(X| Y)?] to get 
Var(X) = E[E(X*|¥)] — E[(X|¥)"] + ELE(X|Y)"] — ELE(X|Y)P. 

Observe now that 


E[E(X|¥)?] — E[E(X|Y)P = Var[E(X|Y)], 


314 27 Best Unbiased Estimators 


so 
Var(X) = E[E(X°|Y)] — E[(X|¥)*] + Var[E(X|Y)]. 
Conditioned on Y = y 
E(X°|Y = y) — E(X|Y = y)’ = Var(X|Y = y). 
A variance is of course always positive so for all y 
E(X*|¥ = y)— E(X|¥ =y)* > 0. 


Therefore, E(X?|Y) _ E(X|Y)? is positive as well. Hence, Var(X) is the sum of 
two positive terms, E[E(X?|Y)] — E[(X|Y)*] and Var[E(X|Y)]. Hence, 


Var(X) => Var[E(X|Y)]. 


The proof of the lemma is complete. 
It is now easy to complete the proof of Rao—Blackwell Theorem. Recall the mean 
squared error formula: 


d(R, 0) = Var(R) + (E(R) — 0)’. 
Since 
E(R’) = E((R|T)) = E(R), 


and R is unbiased so is R’. Hence, in order to compare d(R, 0) to d(R’, 0) we only 
need to compare Var(R) to Var(R’). According to the lemma 


Var(R) > Var(R’) 


and so this completes the proof of Rao—Blackwell Theorem. However, the attentive 
reader will have noticed that we seem not to have used the main hypothesis of the 
theorem, that is, the sufficiency of T! In fact the hypothesis has been implicitly 
used. We have defined R’ = E(R|T), but R’ could very well depend on 0. If R’ 
depends on 9, it is not an estimator of 0! It is because T is sufficient that R’ does not 
depend on @ and is therefore an estimator of 6. The sufficiency hypothesis is crucial 
in providing a true estimator of 0. 


Problems 315 


Problems 


8. 


10. 


11. 


Consider X and Y two independent random variables with density 6e~%* for 
x>0O.LetT=X+Y. 


(a) Show that the conditional density of X|T = t is 


1 
fly = ; forO <x <t. 


(b) How come there is no 6 in f (x|t)? 
(c) Compute P(X > 2|T =1). 


. Consider the family of distributions with density 


f(x|0) =e" ifx > 6 
and f(x|0) = 0 forx <0. 


(a) Is this an exponential family of distributions? 
(b) Find a sufficient statistic for 0. 


Consider a family of Poisson distributions with parameter 6. We want to 
estimate h(@) = exp(—@) = P(X = 0). For an i.i.d. sample X1... Xn, let 
R=1if X; =Oand R = Oif X; > 0. In Example 11 we have computed 


/ wal T 
R= E(R|T)= SF 


where T is the sum of X;. We have shown that R’ is a m.v.u.e. of h(@). Here 
are the values of a sample of size 50. 

02425231022101022212040311413413140102051 
322322330 


(a) Compute R and R’ for this sample. 

(b) Let B be the number of 0’s in the sample. Show that B is a binomial random 
variable with parameters n and h(0). 

(c) Show that B’ = B/n is an unbiased estimator of h(6). 

(d) Compute B’ for the sample above. 

(e) Compute the variance of B’. 

(f) Compute the variance of R’ and compare it to the variance of B’. 


Consider again the family of Poisson distributions with parameter 6. Let h(@) = 
exp(—@). 


(a) Explain why exp(—X) is a reasonable choice to estimate h (6). 
(b) Bho that as n goes to infinity (ey approaches exp(—X), where T = 
a= Xi. 


(c) Compute the variance of exp(—X). 


316 


12. 


13. 


27 Best Unbiased Estimators 
Consider an i.i.d. sample of Gamma random variables. The density is for x > 0 


g(x|9) _ ok yk] —Ox 


(k — 1)! 


where (k, 6) is unknown. 
Show that 


[x 5x) 
i=l i=l 


is a sufficient statistic for (k, 0). 
Consider a sequence X1,..., Xn,... of iid. random variables with expected 
value yz and variance equal to 1. Let R = E(X|X}). 


(a) Show that 


1 n—-1 
R=-X,+ 
n 


LL. 


(b) Compute Var(R) and compare it to Var(X). 
(c) Explain why R is not an estimator of ju. 


Chapter 28 m®) 
Bayes’ Estimators sei 


1 The Prior and Posterior Distributions 


So far 6 has always been an unknown parameter. In this chapter we will take instead 
the Bayesian point of view for which 6 is the value of a random variable. First 
a word on notation, we have consistently used uppercase letters (such as X) for a 
random variable and the corresponding lowercase letter (x) for a possible value. The 
uppercase letter corresponding to 0 is ©. We will give ourselves a so-called prior 
distribution for the random variable © and 6 will be a possible value of ©. We use 
the word “prior” because it is a distribution we pick before (i.e., prior) having a 
sample of observations. Once we have a sample, we compute the distribution of © 
conditioned on the sample. This is the so-called posterior distribution of ©. 

We now describe Bayes’ estimation method. Let A and B be two events, then the 
conditional probability of A given B is defined by 


P(AN B) 


P(A|B) = aah 


We use this definition to compute P(B|A). We get 


P(A|B)P(B 
pcpiay = PAIBYP@) 
P(A) 
Similarly, 
(x1,---,%nlO) f(O) 
One " 
Sf (1, --+5Xn) 
This formula summarizes the method. The function f(@|x1,...,x,) is the 


density of the posterior distribution. The function f(@) is the density of the prior 


© Springer Nature Switzerland AG 2022 317 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8_28 


318 28 Bayes’ Estimators 


distribution. The function f(x1,...,Xn|0) is the familiar likelihood function, and 
Ff (%1,.-+, Xn) is the probability density of the vector (X],..., Xn). 

To avoid more cumbersome notation, we are using the same /f for different 
functions. There should be no confusion because we will always write the variables 
explicitly, such as f(@), f(x1,..., Xn), and so on. 

We start with an example. 


Example I Consider a family of Bernoulli distributions with parameter 6. Assume 
we know nothing about 6 except that it is in (0, 1). We give ourselves a uniform 
distribution for ©, f(@) = 1 for all 6 in (0, 1). The choice of a uniform (a flat 
distribution) shows that 6 is equally likely to be anywhere in (0, 1). Given © = @, 
the Bernoulli probability distribution is 


f(x|0) =071-6)!* 


for x = 0 or |. Given a sample X|,..., Xn, the posterior distribution of © is by 
definition the distribution of © given X1,..., X,. We use the formula: 


fQ1,---, ml FO) 


F(O|xX1,.--,%) = f (X15 65 Xn) 


Given © = @, the random variables X1,..., X», are independent. Hence, the 
function f (x1, ..., X%n|@) is easily computed: 


f(x, ---5XnlO) =f 110)... f Onl) 
=67 (1 — 6)! ,..67(1 — 9)! 
=6'(1 — gyi 
where t =)" 71 Xi: 


The prior density function given in this example is f(@) = 1 for @ € (0, 1). 
Thus, for 0 € (0, 1), 


fel epee 
Missa tS Se 
came Cte) 
The function f(0|x1,...,2,) is the density of the posterior distribution. Note that 
the part in 6 of the function f(@|x1, ..., X,) looks like the density of a Beta density 


with 


a—l=tandb-—l=n-t. 


2 Bayes’ Estimators 319 


Fig. 28.1 We sample n = 10 
Bernoulli random variables 
and observe t = 9. Hence, the 
posterior density is a Beta I 
with parameters | 
a=t+1=10and 
b=n—t+1= 2. Observe 
how we go from a flat prior 
distribution to a rather 
pointed posterior distribution 


Since f(0|x1,..., 2%) is a density function, the part in 6 determines the density 
(we will say more about this). Therefore, the posterior distribution in this example 
is indeed a Beta distribution with parameters a = t + 1 andb = n —t +1, see 
Fig. 28.1. 


2 Bayes’ Estimators 


Once we have computed the posterior distribution, we can compute the so-called 
Bayes’ estimator, which is simply the expected value of the posterior distribution. 
In Example 1, the posterior distribution is a Beta distribution and therefore the 
expected value is 


a i ee ee 


a+b n+2_ n+2 


Note that the Bayes’ estimator is not that different from the maximum likelihood 
estimator x, especially as n — +00. 


Example 2 Let X,,...,X, be a sample for an exponential distribution with 

parameter 0. Assume a prior distribution that is exponential with parameter |. That 

is, f(@) = e~° for 6 > 0. Find the posterior distribution and the Bayes’ estimator. 
The function f (x1, ..., X,|@) is the likelihood function: 


f(x, ---,Xn|@) =f (11/0)... f Xnl9) 
=0e-9*1 ge 9% 


=0"e9 


where t = )~"_, x7. We now use the formula 


320 28 Bayes’ Estimators 


f(X1, +++, nlO) FO) 


O|x1,.-.,Xn) = 
a (OA) 
We get 
FOlz1, .-.5%n) =———__~6"e%e® 
pier de) = 
FQly s+ Xp) 
= i 9" eA) 
Sf (x1, ---5 Xn) 


Observe that the part in 6 is the same as the expression of a Gamma density with 
r—1l=nand’ =t +1. Hence, the posterior distribution is a Gamma distribution 
with parameters r = n+ 1 andA = t+ 1. Thus, the Bayes’ estimator (i.e., the 
expected value of the posterior distribution) is 


r n+1_ n+1 


A ttl 14+ 


Example 3 In Example | we consider a Bernoulli distribution with parameter 0 
and a uniform prior. We showed that the posterior distribution is then a Beta with 
parameters a = 1 +f andb=n+1-—t, where t = )~'_, x;. Use this fact to find 
the joint distribution of (X1,..., Xn). 

From Example 1, we know that the density of the posterior distribution is for 
6 € (0, 1) 


= 1 a-1 b-1 
f(O|x1,..., Xn) = eT ad 0) . 


Recall that the beta density with parameters a and b is for x in (0, 1) 


T(a+b) 
T(a)T(b) 


67-1 6)e-!, 


The two expressions have the same terms in 0, but are the constants (i.e., the terms 
without 6) equal? We now show that they are. 


Since reared — 6)’-! is a probability density, its integral is 1 and 
therefore 
1 
Ta@rob 
; 67! — 6)?" !de = For) 
0 (a+b) 


2 Bayes’ Estimators 321 


But the integral of the posterior density is also 1. Hence, 


1 1 
| 67-11 — 0)°-!d6 = 1. 
cy Nip) 0 


f(x,.- 
Thus, 
1 
/ 67-11 — 6)?" do = f(x1,...,Xn)- 
0 
Therefore, 
_ Tere) 
f(1,---5%Xn) = T(a+b)' 


This shows as claimed before that to identify a posterior distribution it is enough 
to identify the expression in 0. This computation also gives the distribution of the 
vector (X1,..., Xn). Its probability density is 


EG DiGe= td) 
FOI, +++ Xn) = Tadd : 


Recall that for a positive integer p, [(p) = (p—1)!. Since t = )77_, x; where each 
x; can only be a 0 or 1, ¢ is a positive integer. Hence, 


F( ) t!(n —t)! 

X1,...,X,) = ————.. 

; ar ee hy 

This also shows that the random variables X1, X2,..., Xp are not independent; see 
the problems. 

Example 4 Let X,,..., X, be a sample with a normal distribution with mean 0 


and variance 1. Let the prior be a standard normal distribution. Compute the Bayes’ 
estimate. 
The density of the prior distribution is 


1 
J 20 


Given 0, the distribution of an observation is 


f@= exp(—6*/2). 


i 2 
Ff (x|@) = exp(— 5 — @)"). 


3] - 
q 


322 28 Bayes’ Estimators 


Therefore, 


aes Io ne 
Ft Fal) = TE, Om gL 6)°). 


Hence, the posterior distribution is 


fr, +++, nl FO) 


a f (11, -.-, Xn) 
— : 0 2 4 ‘ 
Fen) (Em 7a ao ae 
= 1 1 1 . ee 
© Ff Otj kaeth) Gaye exp( 5 2s 6) 38 ) 


As remarked earlier, if we can identify the expression in 0, we will be done. By 
expanding the square, 


ones 1 ine Z 
eG ay + 56? = = (af — 20) x no? +0"). 
i=l i=l i=l 


Hence, 


r= 1 n+1 2 i 1 
6 4+-62 = (0? 6 ) aye 
5 ) 5 5) Pel di +508 


We now “complete the square” to get 


1 > 15 n+l Ih nee. 2 
5 ey + 58 = (6 ) xi) 
i= 


n+1 1 a | 2 
2 Gare xi) +5 Ux 


We plug this side computation into the posterior distribution to get 


flexi, ...,%n) = eC, «5 %n) exp ( a “(0 ! yx)) 


2 Bayes’ Estimators 323 


where 


n 


: ! 1 - pt 
FOI» «+++ %n) (V2n th oo (ap lL*) -32). 


i=1 


&(X1,---,Xn) _ 


We state the exact expression of g for the sake of completeness, but we only need 
to know that it does not depend on 6. We are now ready to identify the posterior 
distribution as a normal distribution. Recall that a normal distribution with mean pz 
and variance o7 has density 


1 1 ‘ 
ae P73 @ - 4). 


Therefore, the posterior distribution is a normal distribution with mean 
a yoy, x and variance 1/(n + 1). Hence, the Bayes’ estimate for @ is 


Example 5 Here is a sample for a normal distribution with unknown mean and 
variance 1. 

1.5377 2.8339 —1.2588 1.8622 1.3188 —0.3077 0.5664 1.3426 4.5784 3.7694 
—0.3499 4.0349 1.7254 0.9369 1.7147 0.7950 0.8759 2.4897 2.4090 2.4172 1.6715 
—0.2075 1.7172 2.6302 1.4889 2.0347 1.7269 0.6966 1.2939 0.2127 

According to Example 4, with a standard normal prior we get a normal posterior 
with mean a yore, %1 and variance 1/(n + 1). Here n = 30 and )°"_, x; = 
46.5569. Hence, the Bayes’ estimate is 1.5018. 

What is the probability that © is between 1.3 and 1.5? This can be computed 
using the posterior distribution that is normally distributed with mean 1.5018 and 
variance 1/31. Recall that 


@— 1.5018 _ 
sya 


where Z is a standard normal variable. Hence, 


P(1.3 <@ <1.5|x,...x39) = P(—1.12 < Z < —0.01) = 0.36. 


324 28 Bayes’ Estimators 


Problems 


1. Consider a family of Bernoulli distributions as in Example 1. Instead of taking 
a uniform prior as we did there, take a prior that is a Beta with parameters 
a=b=2. 


(a) Sketch on the same graph the density of a uniform and the density of a Beta 
with parameters 2 and 2. 

(b) Compute the posterior distribution. 

(c) Compute the Bayes’ estimator. 


2. Let X1,..., X, be asample of a Poisson distribution with mean @. Let the prior 
distribution be an exponential distribution with parameter |. That is, the prior 
density is 


f(@) = exp(—8@) for 6 > 0. 


(a) Show that the posterior distribution is a Gamma. With what parameters? 
(b) Find the Bayes’ estimator. 


3. (a) Show that fora > 1 andb > 0, 


P(a) 
be ~ 


foe) 
i x7! exp(—bx) = 
0 


(b) Use (a) to compute the expected value of exp(—X), where X has a Gamma 
distribution with parameters a and b. 

(c) In problem 2 we have shown that for a Poisson distribution if the prior is 
exponentially distributed, then the posterior has a Gamma distribution. Use 
this result and (b) to find a Bayes’ estimator for exp(—@), where @ is the 
mean of the Poisson distribution. 

4. Use a prior Gamma distribution with parameters r and A. Find the Bayes’ 
estimator for the mean of a Poisson distribution. 

5. Let X1,..., Xn» be a sample with a normal distribution with mean 6 and 
variance 1. Let the prior be a normal with mean a and variance |. Compute 
the Bayes’ estimator. 

6. Let X1,..., X, be a sample of a normal distribution with mean 0 and variance 
1/0. Note that for computational convenience our parameter is the inverse of 
the variance. Let the prior be an exponential random variable with parameter 1. 


2 Bayes’ Estimators 325 


(a) Show that the posterior distribution has density 


1 ee 
6”! exp(——0 Y x?) exp(—0) 
Sf (X1,--+5 Xn) 270)" 2 d, ‘ 
for 0 > 0, where f(x1,...,X,) is the density of the vector (X1,..., Xn). 


(b) Find the Bayes’ estimator. 


7. Consider a sample from a uniform distribution on (0, 6). Assume that the prior 
distribution is uniform on (0, 1). 


(a) Show that the density of (X1,..., Xn, ©) is 


1 
Fy. -+14m 8) = 5, for O € (ms I) 


where x(,) is the largest value of the sample. 
(b) Show that the density of (X1,..., Xn) is 


| 
aso) —dé. 


(c) Using (a) and (b), compute the posterior distribution. 
(d) Compute the Bayes’ estimator. 


8. In Example 1, estimate the probability P(0.2 < © < 0.3|x1,...,xn). Use 
n = 10 and t = 9. (You will need to compute the integral numerically.) 
9. Consider a sample of Bernoulli distribution with a uniform prior on (0, 1). We 


have shown in Example 3 that the density of the vector (X1,..., Xn) is 
t!(n —t)! 
f(1,---,Xn) = wet 


where t = )°/_, xj and each x; = Oor 1. 


(a) Letn = 2. Compute f (0,0), f(0, 1), fC, 0), and fC, 1). 
(b) Show that X; and X>2 are not independent. 


Chapter 29 


Multiple Linear Regression 


1 The Least Squares Estimate 


Check for 
updates 
(at Se 


We will use the following data set from a 2010 World Health Organization report. 


Country Mortality Measles DTP3 Water Sanitation 
Afghanistan 257 75 85 48 37 
Albania 14 98 99 97 98 
Algeria 41 88 93 83 95 
Andorra 4 98 99 100 100 
Angola 220 719 81 50 57 
Argentina 15 99 96 97 90 
Armenia 23 94 89 96 90 
Australia 5 94 92 100 100 
Austria 4 83 83 100 100 
Azerbaijan 36 66 710 80 45 
Bangladesh 54 89 95 80 53 
Barbados 11 92 93 100 100 
Belarus 13 99 97 100 93 
Belgium 5 93 99 100 100 
Belize 19 96 94 99 90 
Benin 121 61 67 75 12 
Bhutan 81 99 96 92 65 
Bolivia 54 86 83 86 25 
Bosnia 15 84 91 99 95 
Botswana 31 94 96 95 60 
Brazil 22 99 97 97 80 
(continued) 


© Springer Nature Switzerland AG 2022 
R. B. Schinazi, Probability with Statistical Applications, 


https://doi.org/10.1007/978-3-030-93635-8_29 


327 


328 29 Multiple Linear Regression 


Country Mortality Measles DTP3 Water Sanitation 
Bulgaria 11 96 95 100 100 
Burkina Faso 169 75 79 76 11 
Burundi 168 84 92 72 46 
Cambodia 89 89 91 61 29 
Cameroon 131 80 84 74 47 
Canada 6 94 94 100 100 
Cape Verde 29 96 98 84 54 


We will first explain the method on this specific example. We will provide some 
proofs at the end of the chapter. The missing proofs can be found in Linear Models 
by S.R. Searle, 1971 (John Wiley). 

The table above gives the following information. For a given country, let Y be 
the under 5 child mortality (number of children dead by age 5 per thousand births), 
X, the percentage of children vaccinated against measles, X2 the percentage of 
children vaccinated against diphtheria, tetanus, and pertussis infections (a single 
vaccine called DPT3 takes care of the three infections), X3 the percentage of the 
population that has access to clean water, and let X4 be the percentage of population 
who have access to sanitation (sewage system, septic tanks, and so on). In this 
chapter we will address the following question. Is Y a linear function of the variables 
X1, X2, X3, X4? 

More generally, the main purpose of this chapter is to see whether some variable 
Y can be predicted as a linear function of explanatory variables X1,..., X p—1. 

We use X1, X2,..., Xp—1 to predict Y in the following linear model: 


Y = bo + bi X1 + 2X2 + +++ + byp-1X p-1, 


where bo, b1, bz, ..., bp—1 are constants to be estimated. If p = 2, there is only one 
explanatory variable and the model is called a simple regression model. For p > 3, 
it is called a multiple linear regression model. 

It is convenient to rewrite the model in matrix form as 


Y = Xb, 


1 The Least Squares Estimate 329 


where 

257 17585 48 37 

14 19899 97 98 

41 18893 83 95 

4 1 98 99 100 100 ; 

220 17981 50 57 - 
1 

gis lide x — | 19996 97 90 = 

23 19489 96 90 

5 1 94 92 100 100 _ 

4 1 83 83 100 100 Pol 

29 19698 84 54 


Let y; be the i-th component of vector Y and x;,; be the entry in the matrix X at the 
i-th row and j-th column. 


¢ The least square estimate b minimizes the quantity 


n 
$6) = 00; bo — b1xi,1 — boxi,2 — +++ — bp—1%i,p—1)”. 
i=1 


That is, for all b, S(b) < S(b). 


Before giving a formula for b, we need some additional notation. We denote the 
transpose of the matrix X by X’. That is, the first row of X’ is the first column of X, 
the second row of X’ is the second column of X, and so on. In our example we have 
n = 28 observations. So X isan x p = 28 x 5 matrix. That is, X has 28 rows and 
5 columns. Hence, X’ isa p x n =5 x 28 matrix. 


e The least squares estimate bis given by the normal equations: 
(XX’)b = X’Y. 


Provided X’X is invertible, there is a unique solution to the normal equations 
given by 


b = (X’X)|X’y. 


330 29 Multiple Linear Regression 


In our example the product X’X is given by 


28 2480 2528 2441 1972 
2480 222468 226070 218880 180217 
X’X = | 2528 226070 230154 222174 182379 
2441 218880 222174 219261 181263 
1972 180217 182379 181263 163252 


and 
7.2998 0.0546 —0.1215 —0.0261 0.0163 
0.0546 0.0036 —0.0037 —0.0005 0.0000 
(X’x)7! = | —0.1215 —0.0037 0.0046 0.0005 —0.0002 
—0.0261 —0.0005 0.0005 0.0004 —0.0001 
0.0163 0.0000 —0.0002 —0.0001 0.0001 
This yields 
410.1383 
—0.8081 
b=] 0.6617 
—3.7826 
—0.1375 


That is, the least squares estimate of Y using the variables X;, X2, X3, X4 is given 
by 


Y¥ = 410.1383 — 0.8081 X, + 0.6617X2 — 3.7826X3 — 0.1375X4. 


2 Statistical Tests 


The least squares method used in the preceding section gives a linear equation to 
explain how Y varies as a function of X1, X2,..., Xp—1. In this section we will 
use statistical inference to decide how good this equation is. In order to do so we 
need an underlying probability model that we now formulate. We will assume that 
the variable Y we wish to explain is random and that the explanatory variables 
X1,..., Xp are deterministic. 

We now state our assumptions. They will be in force for the rest of this chapter. 


e We assume the following model: 


Y=Xb+e, 


2 Statistical Tests 331 


where e is a column vector with n components e;. The random variables 
€1,..-,@, are normal i.id. with mean 0 and variance 02, where o > Oisa 
parameter that will need to be estimated. 


As a consequence of our assumptions, for every i, yj is a linear function of 
the explanatory variables X;, X2... to which we add the normal random variable 
e;. Since e; has mean 0, E(y;) is a linear function of the explanatory variables 
X 1, X2.... Hence, e; represents the fluctuations of y; around its mean. 


2.1 Sums of Squares 


Let 


Xo = 


be a column vector with n components all equal to 1. Let 


Y = yXo. 
be a column vector with all n components equal to y = i yr" yi. We will use 
several times the following fact. If a is acolumn vector with components a1, ..., dn, 


then a’ is a row vector and 


n 
aa =) a?. (1) 
i=l 
We now define the sums of squares that will determine how fit the model is. By 
(1), 
n 
SST =) Gi -Y = -VYY-Y), 
i=1 


where SST is called the total sum of squares. Let 
Y = Xb. 


That is, Y is the vector of y’s predicted by the least squares method. If Y is close 
enough to Y (the observed y’s), then the model is probably adequate. To measure 


332 29 Multiple Linear Regression 


closeness, we define another sum of squares 
n 
SSE =) i -3i)? = - PW - ¥). 
i=1 


The sum of squares SSF is called the residual error sum of squares. This is so 
because each y; — 3; represents the “error” made by the linear approximation. The 
smaller SSE is compared to SST the better the model is. The third sum of squares 
is defined by 


SSR=) (i -3P = -YY -Y), 
i=l 


the sum of squares due to regression. Next we state the formula relating the different 
sums of squares. 


e We have the following partitioning of the total sum of squares: 


SST = SSE+ SSR. 


2.2. The R Statistic 


A first measure of the goodness of fit of the linear model is the following statistic: 


* The R? statistic is defined by 


2 SSR 
~ SST" 


The coefficient R is always in [0, 1]. The closer it is to | the better the linear 
model fits the data. 


In our example, 
SST = 1.3368 x 10°, SSR = 1.1082 x 10°, SSE = 2.2862 x 10*. 


Hence, 


SSR 
R*= <a S 0.83. 


Taking the square root of R?, we get R = 0.91, which is pretty high and shows a 
good fitness of the linear model. 


2 Statistical Tests 333 


Remark It is easy to artificially increase R*, making the model appear better than 
it is. Every time we add an explanatory variable R? increases. So if we add enough 
explanatory variables (even if they have nothing to do with the model!), we can 
get as close to 1 as we want. This, of course, is not recommended. The point of a 
mathematical model is to explain something in the simplest possible way. Hence, 
one should strive to keep p as low as possible. 


2.3 Significance of the Model 


We test whether 


by =by =--- =by 1 =0. 


If we cannot reject b}) = bz = --- = bp_; = 0, we can conclude that Y is likely not 
a linear function of the explanatory variables. The model is not adequate. We now 
introduce such a test: 


¢ To test 


Ho: bj) = by =---=by =0 
against 
H, : at least one b; for 1 <i < pis not0 


we use the statistic 


_ SSR/(p = 1) 
~ SSE/(— p)’ 
where p — | is the number of explanatory variables X;, X2,..., Xp—1 and n is 


the number of observations in the sample. Under the null hypothesis, F follows 
an F distribution with degrees (p — 1, — p). The P value of the test is given by 


P=P(F(p-1,n-p)> F). 


Under the null hypothesis, it can be shown that SSR and SSE are independent 
and are distributed according to a chi-squared distribution with degrees p — | and 
n — p, respectively. This is why F follows an F distribution with degrees p — 1 and 
n— p. 

In our example p = 5 and n = 28. Hence, 


F = (SSR/(p — 1))/(SSE/(n — p)) = 27.87, 


334 29 Multiple Linear Regression 


where F follows an F distribution with degrees (4, 23). By the F table, 
P(F(4, 23) > 4.31) < 0.01. 


Hence, our P value, P(F'(4, 23) > 27.87), is much smaller than 1%. We reject the 
null hypothesis. This is strong evidence that at least one of the explanatory variables 
explains Y. 


2.4 Estimating the Variance 


2 


We need to estimate the variance o~ in order to test an individual b;. 


¢ An unbiased estimator of o2 is 


4 SSE 
Gus 


n—p 
In our example SSE = 2.2862 x 10*, n = 28, and p = 5. Hence, 


A E 
o2= _ = 994. 
n— p 


Thus, an estimate for o is then 994 = 31.53. 


2.5 Testing Individual Regression Coefficients 


We are now ready to test individual b;’s. 
¢ Under the assumption of normality, 


5 — b; 


G2 


Cii 


follows a Student distribution with n — p degrees of freedom, where c;; is the 
i-th term in the diagonal of (X'X)~!. 


The property above is a consequence of the following facts. The vector b is 
normal, 6 and SSE are independent, and SSE has a chi-squared distribution with 
n — p degrees of freedom. The proofs can be found in Searle (1971). 

We now apply this property to test whether b2 = 0 in our example. 


Problems 335 


We perform the test 
Hj: bo =0 
Hy : by #0. 


Since we numbered our b;’s starting at i = 0, the rows and columns of the 
matrix also start at 0. Hence, we read the third term on the diagonal of (X’ X)~! to 


get c22 = 0.0046. Hence, ./62c22 = 2.14. Thus, 


A 


bo 
V62cn 


Therefore, the P value for this two sided test is 


= 0.31. 


P=2P(t(23) > 0.31) > 0.2. 


We cannot reject the null hypothesis. This does not show that the DPT3 vaccination 
rate is unimportant. It only shows that the explanatory variable X2 does not add 
significant information to the model. It could be the case that X2 is highly correlated 
to the other vaccination variable X, and therefore that we do not need both variables 
in the model. 


Problems 


The problems below refer to the example treated in this section. 


1. (a) Explain why we should expect the coefficients of X1, X2, X3, X4 to be 
negative. 
(b) In your opinion what are the most important variables in explaining Y? 
Explain your reasoning. 
2. (a) Compute the best linear approximation of Y using only the variables X1, 
X3, and X4. 
(b) Compute R?, 
(c) Test whether bj = b3 = ba = 0. 
(d) Compare this model to the full model. 
3. (a) Compute the best linear approximation of Y using only the variable X2. 
(b) Compute R?, 
(c) Test whether b2 = 0. 
(d) Compare this model to the full model. 


336 29 Multiple Linear Regression 


4. In this problem we find the least squares estimates in the particular of a single 
explanatory variable and a constant. That is, the model is 


Y=bo+)X. 


We have n observations denoted by ();,x;) fori = 1,2,...,2. The matrix 
notation for this particular case is 


YI 1 xy 
y2 1 x2 
Y= b= (*") X= 
by 
Yn 1 Xp 


(a) Show that 


n 
XX = ( : a): 
yy Xi ae 


(b) Assuming that X’X is invertible (i.e., the determinant of the matrix is not 
0), show that 


(X’X) = 1 vin ¥ =e : 
det (X'/X) \— Oy, xi n 


(c) Show that 
h = (?:) = a, Cae x} j=1 Yi — Deja Xi ae 
det(X’X) \ nn ia mii — Via Xi jar Yi 


(d) Apply the formula in (c) to the model Y = bo + b4X4. 
(e) When is det (X’X) = 0? 


3 Proofs 


3.1 The Normal Equations 


We are looking for the best least squares approximation of Y as a linear combination 
of the variables (Xo, X1,..., Xp—1). We think of Y, Xo,..., Xp—1 as vectors in 
IR”, where n is the number of observations. Let W be the vector subspace of R”, 
which is spanned by vectors Xo, X1,..., Xp—1. Let Y be the orthogonal projection 


of Y on W. The best approximation of Y by a vector in W is Y; see, for instance, 


3 Proofs 337 


Sect. 6.4 in “Elementary Linear Algebra” (1 1th edition) by Anton and Rorres. Recall 


that X be the n x p matrix whose columns are (Xo, X1,..., Xp—1), where Xo is 
a column of 1’s. Since Y belongs to W, it is a linear combination of the vectors 
(Xo, X1,..., Xp—1). Therefore, Y can be written as 

Y = Xd, 


for some column vector 6 with p components. Hence, 
Yy-Y=Y-Xb. 
Multiplying both sides by X’ yields 
x’ (y -¥) =x’ (y — xs). 
Since Y is the orthogonal projection of Y on W, Y — Y is orthogonal to W and 


therefore to all the vectors Xo, ..., Xp—1. The multiplication x’ (v _/Y ) yields a 
column vector T with p components. It is easy to see that the i-th component of T 


is the inner product of the vectors Y — Y and X;. Since these vectors are orthogonal, 
the i-th component must be 0. This is true for every component of T and therefore 
T is a zero vector. Hence, 

xX'(Y — Xb) =0. 
Therefore, 


(X’X)b = X’Y. 


This completes the proof of the normal equations. 


3.2 Partitioning the Sum of Squares 


We now prove that 
SST = SSE+ SSR. 
We subtract and add Y to get 


SST =(Y —Y)'(¥Y -—Y) 


=(Y-Y+Y-YV)(y-Y+Y-Y) 


338 29 Multiple Linear Regression 
=(Y—Yy(Y-Y)+(¥-Y)Y(¥-Y) 
Ee (ee At cee ae 40 aw Ah 6 as OF 
The first term in the r.h.s. is SSE and the last term is SSR. So we only need to 
show that the two middle terms are 0. Note that the second term is the transpose of 
the third. Hence it is enough to show that the third term is 0. We now do that. We 
start with 


Y-YyYy-Vvphn=Kryr-V/-YKryr-YP). 


Using the notation from Sect. 3.1, ¥ belongs to W (the subspace spanned by 
Xo, ..., Xp—1) and Y — Y is orthogonal to W. Hence, 


Similarly, Y = )Xo belongs to W since Xo belongs to W. Hence, 
Y'(Y —Y)=0. 
Therefore, 
(Y —Y)(Y —Y) =0. 


This completes the proof that SST = SSE + SSR. 


3.3. Expectation and Variance of a Random Vector 


e Let V be arandom vector with components V;, V2... V,. We define the expec- 
tation E(V) of the vector V as the vector with components E(V}),...E (Vn). 


We have the following linear properties for the expectation: 


¢ Let V and W be two random vectors with the same dimensions, then 
EV+W)=E(V)+E£E(W). 


Moreover, for non-random matrices A and B with the appropriate dimensions we 
have 


E(AV) = AE(V) and E(VB) = E(V)B. 


3 Proofs 339 


We apply these properties to show the following: 
* The estimator b is an unbiased estimator of b. 


We now prove this. Recall that the least squares estimate of b is 


b=AY 
where 
A = (X’x) |x’ 
is anon-random matrix. Since 
Y =Xb+e 


and E(e) is the zero vector, then E(Y) = Xb. Hence, 
E(b) =AE(Y) 
=AXb 
=(X'X)7!(X’X)b 
=b. 
We now turn to the variance of a random vector. 


¢ Let V be acolumn random vector with n components Vj, ..., V,. We define the 
variance of V as 


Var(V) = E(VV') — E(V)E(V)’. 
It is easy to check that Var(V) is an x n matrix with the following components. 


At the intersection of row i and column j, we find Cov(V;j, Vj). 
We have the following properties for the variance of a vector: 


e¢ Let V be a random vector and c a non-random vector, then 
Var(V +c) = Var(V). 
Let A be a non-random matrix, then 


Var(AV) = AVar(V)A’. 


340 29 Multiple Linear Regression 


We apply these properties to our model Y = Xb + e. Recall that e has normal 
components that are i.i.d. with mean 0 and variance o”. Hence, 


Var(e) = 07 In. 
Since Xb is non-random, 


Var(Y) =Var(Xb + e) 
=Var(e) 


=o"I,. 


We now compute the variance of b. 
e The variance matrix of b is 


Var (b) = 0?(X’X)"!. 


We start by writing again 


> 
ll 
> 
Lae) 


where 
A = (X'X) |X’. 
Hence, 
Var (b) =AVar(Y)A’ 
=(X'X)7!X’o7 I, [(X’X) 1X’). 


Observe now that X’X is a symmetric matrix (that is, it is equal to its transpose). 
The inverse (if it exists) of a symmetric matrix is also symmetric, hence 


[(X’X) |X’ = X(X’X) I, 
and 
Var(b) = 0?(X’X)~!(X’X) (XX)! = 02 (XX). 


This completes the computation of Var (b). 


3 Proofs 341 
3.4 Normal Random Vectors 


¢ The random vector V is said to be normal if V = AZ +c, where A and c are 
a deterministic matrix and vector, respectively. The vector Z is a random vector 
with standard normal i.i.d. components. 


Recall that our model is Y = Xb + e, where the components e; of e are 1.i.d. 
normal random variables with mean 0 and variance o?. For 1 <i <n, let 


1 
Zi = 52 
Then, Z|,..., Z, are i.i.d. standard normal random variables. Let Z be the vector 
with components Z1,..., Zn. 
Hence, 
e= AZ, 


where A = 07 I, (I, is the n x n identity matrix). Thus, e is a normal vector. 
We can rewrite Y as 


Y =AZ+Xb. 


This shows that Y is a normal vector as well. 
Recall that 


Y = (XX)! N’Y. 
Therefore, 


Y =(X’X)~!X'(AZ + Xb) 
=(X'X)7!X’AZ + b. 


This shows that Y is a normal vector. 


¢ Normal vectors have remarkable properties. In particular, two components of 
a normal vector are independent if and only if their covariance is 0. Quadratic 
forms of the type V’QV, where V is a normal vector and Q is an idempotent 
matrix (i.e., Q? = Q), have chi-squared distributions. Two such quadratic forms 
can be shown to be independent by matrix multiplication. The sums of squares 
SST, SSR,and SSE can all be written as quadratic forms of normal vectors. The 
statistical inference for the regression model is based on these properties. 


342 29 Multiple Linear Regression 
Problems 


1. Let V be acolumn vector with two components V, and V2. 


(a) Compute VV’. 
(b) Show that 


Var iy= ( Var(V1) Cov(V,, ) 


Cov(Vi, V2) Var(V2) 


2. Let V be a column vector with two components V; and V2. Let A be a2 x 2 
non-random matrix. Show that 


Var(AV) = AVar(V)A’. 


3. Let V be a column vector with two components V; and V2. Let A be a2 x 2 
non-random matrix. Show that 


E(V'AV) = E(V’)AE(V) + trace(AVar(V)), 


where the trace of a matrix is the sum of its diagonal terms. 
4. Show that b is a normal vector. 
5. Recall that 


R R -1 
SSR gp — SSR/IP- 1) 


R* = —— an = . 
SST SSE/(n — p) 


Show that 


2, kIe=) 
(1 — R2)(n — p)’ 


P = X(X’X) |X’. 


(a) What is the size of matrix P? 

(b) Show that P is symmetric (i.e., P’ = P). (You may use the fact that the 
inverse of a symmetric matrix is symmetric.) 

(c) Show that P is idempotent (i.e., Pp? = P). 

(d) Show that if A is idempotent, so is I— A, where I is the identity matrix. 


Problems 343 


7. In this problem we prove that SSE /(n — p) is an unbiased estimator of o?. 


(a) Show that 


¥ =PY 
where 
P= X(X’X) |X’. 
(b) Show that 
SSE = Y'In— P)Y, 
where I, is the n x n identity matrix. 
(c) Show that 
E(Y'(In — P)Y) = E(Y’)(n — P)E(Y) + 0 trace(I, — P). 
(Use Problem 3.) 
(d) Show that 
Un — P)E(Y) =0 
and hence 
E(Y' (Ip — P)Y) = otrace(Iy — P). 
(e) Show that 
trace(P) = trace(Ip) = p. 
(You may use that trace(AB) = trace(BA) for any same size square 


matrices A and B.) 
(f) Use (e) to show that 


E(Y' (In — P)Y) = 07 (n— p). 


Show that SSE/(n — p) is an unbiased estimator of o?. 


re 


(g 


List of Common Discrete Distributions 


Bernoulli with parameter p: 
P(X =0)=1- pand P(X = 1)=p 
E(X) = p 
Var(X) = p( — p) 
Mx(t) =1— p+ pe. 
Binomial with parameters n and p: 
POS b= (j) ota — p)"-* fork =0,1,...,n. 
E(X) =np 
Var(X) =np(1 — p) 
Mx(t) = (1— p+ pe’)”. 
Geometric with parameter p: 


P(X =k) = (1— p)*"'p fork = 1,2,.... 


1 
aes 
P 


© Springer Nature Switzerland AG 2022 345 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8 


346 List of Common Discrete Distributions 


i 
Var(X) = —? 
Te ee ae 
ape 


Poisson with parameter i: 
us 
P(X =k) =e "7 fork =0, | oe 


E(X)=iA 
Var(xX)=iX 


Mx(t) = exp((e’ — 1)). 


List of Common Continuous 
Distributions 


Beta with parameters (a, b): 


f@= a x)?! forO<x <1 
a 
Ba a+b 
Var(X) = a 


(a+b)2(a+b+1) 
Exponential with parameter a: 


f(x) =ae ™ forx > 0 


1 
E(X)=- 
a 
1 
Var(X) = = 
a 
a 
Mx(t)= fort <a. 
a-t 
Gamma with parameters (r, 4): 
rv r—1,-Ax 
f(x) = ——x" Ve ™ forx > 0 
rr) 
E(X)=- 


© Springer Nature Switzerland AG 2022 347 


R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8 


348 List of Common Continuous Distributions 
x r 
Var(X) = a2 


Mx(t) = ( A y fort <A 
= (—~ Oorg<a. 
in La 


Chi-squared with k degrees of freedom: 


1 k 
— LR /2-1 x /2 
f= TTR e forx > 0 
E(X)=k 
Var(X) = 2k 


1 
My(t) = Geax)” fort < 1/2. 
Normal with mean w and standard deviation o: 


1 Di 
ae eee 
gg for —0o <x < +00 


f@= 


o/2n 
E(X)= ph 
Var(X)= o 
My(t) = elt 207 


Standard normal: 


1 —4x? 5 4 
e or —0O <x < +0 
J 20 


E(X)=0 


f@)= 


Var(x)=1 
1,2 
Myx(t) =e2°. 
Uniform on [a, b]: 


fora<x<b 


f@)= 


b-a 


List of Common Continuous Distributions 349 


a+b 


E(X) = — 


(b—a) 


Var(X) = D 


Further Reading 


Probability 


The following two references are at about the same level as this book. They cover 
additional topics and examples in probability. 

The essentials of probability by R. Durrett (Duxbury Press). 

Probability by J. Pitman (Springer-Verlag). 

An introduction to probability theory and its applications by W. Feller (Volume 
I, third edition, Wiley) is at a great book. It covers hundreds of interesting topics 
and examples. It is, however, rather terse and requires a substantial effort from the 
reader. 

Theoretical probabilities for applications by S. Port, Wiley. This is a very good 
advanced probability text. 


Statistics 


A good elementary introduction to statistics is Introduction of the practice of statis- 
tics by D. Moore and G. McCabe (second edition, Freeman). A more mathematical 
approach to statistics is contained in Probability and Statistics by K. Hastings 
(Addison-Wesley). 

At an intermediate level, the reader may read Introduction to the Theory of 
Statistics by A.M. Mood, F.A. Graybill, and D.C. Boes (1974) (third edition, 
McGraw Hill), Mathematical statistics and data analysis by J.A. Rice (third edition, 
Thomson) or Statistical inference by G. Casella and R.L. Berger (2002) (second 
edition, Duxbury). 

For simple and multiple linear regression, A second course in Statistics (Fifth 
edition) by W. Mendenhall and T. Sincich is a good text focusing on the applied 


© Springer Nature Switzerland AG 2022 351 
R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8 


352 Further Reading 


side of things. For the theory, Linear Models by S.R. Searle, 1971 (John Wiley) is a 
good text. The reader will find there the proofs we omitted in Chap. 29. 

To find the mathematical proofs that were omitted in Chap. 27 as well as many 
other results, the reader may consult The theory of statistical inference by S. Zacks 
(Wiley). This is an advanced text. 


Standard Normal Table 


AS 


The table below gives P(0 < Z < z). 
For instance, P(O < Z < 0.65) = 0.2422. 


z 


0.0 


0.00 


0.01 
0.0040 


0.02 
0.0080 


0.03 
0.0120 


0.04 
0.0160 


0.05 
0.0199 


0.06 
0.0239 


0.07 
0.0279 


0.08 
0.0319 


0.09 
0.0359 


0.1 


0.0438 


0.0478 


0.0517 


0.0557 


0.0596 


0.0636 


0.0675 


0.0714 


0.0753 


0.2 


0.0832 


0.0871 


0.0910 


0.0948 


0.0987 


0.1026 


0.1064 


0.1103 


0.1141 


0.3 


0.1217 


0.1255 


0.1293 


0.1331 


0.1368 


0.1406 


0.1443 


0.1480 


0.1517 


0.4 


0.1591 


0.1628 


0.1664 


0.1700 


0.1736 


0.1772 


0.1808 


0.1844 


0.1879 


0.5, 


0.1950 


0.1985 


0.2019 


0.2054 


0.2088 


0.2123 


0.2157 


0.2190 


0.2224 


0.6 


0.2291 


0.2324 


0.2357 


0.2389 


0.2422 


0.2454 


0.2486 


0.2517 


0.2549 


0.7 


0.2611 


0.2642 


0.2673 


0.2704 


0.2734 


0.2764 


0.2794 


0.2823 


0.2852 


0.8 


0.2910 


0.2939 


0.2967 


0.2995 


0.3023 


0.3051 


0.3078 


0.3106 


0.3133 


0.9 


0.3186 


0.3212 


0.3238 


0.3264 


0.3289 


0.3315 


0.3340 


0.3365 


0.3389 


1.0 
1.1 


0.3438 
0.3665 


0.3461 
0.3686 


0.3485 
0.3708 


0.3508 
0.3729 


0.3531 
0.3749 


0.3554 
0.3770 


0.3577 
0.3790 


0.3599 
0.3810 


0.3621 
0.3830 


1.2 


0.3869 


0.3888 


0.3907 


0.3925 


0.3944 


0.3962 


0.3980 


0.3997 


0.4015 


1.3 


0.4049 


0.4066 


0.4082 


0.4099 


0.4115 


0.4131 


0.4147 


0.4162 


0.4177 


1.4 


0.4207 


0.4222 


0.4236 


0.4251 


0.4265 


0.4279 


0.4292 


0.4306 


0.4319 


1.5 


0.4345 


0.4357 


0.4370 


0.4382 


0.4394 


0.4406 


0.4418 


0.4429 


0.4441 


© Springer Nature Switzerland AG 2022 
R. B. Schinazi, Probability with Statistical Applications, 


https://doi.org/10.1007/978-3-030-93635-8 


(continued) 


353 


354 Standard Normal Table 


0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 


1.6 | 0.4452 | 0.4463 | 0.4474 | 0.4484 | 0.4495 | 0.4505 | 0.4515 | 0.4525 | 0.4535 | 0.4545 


1.7 | 0.4554 | 0.4564 | 0.4573 | 0.4582 | 0.4591 | 0.4599 | 0.4608 | 0.4616 | 0.4625 | 0.4633 
1.8 | 0.4641 | 0.4649 | 0.4656 | 0.4664 | 0.4671 | 0.4678 | 0.4686 | 0.4693 | 0.4699 | 0.4706 
1.9 | 0.4713 | 0.4719 | 0.4726 | 0.4732 | 0.4738 | 0.4744 | 0.4750 | 0.4756 | 0.4761 | 0.4767 
2.0 | 0.4772 | 0.4778 | 0.4783 | 0.4788 | 0.4793 | 0.4798 | 0.4803 | 0.4808 | 0.4812 | 0.4817 
2.1 | 0.4821 | 0.4826 | 0.4830 | 0.4834 | 0.4838 | 0.4842 | 0.4846 | 0.4850 | 0.4854 | 0.4857 
2.2 | 0.4861 | 0.4864 | 0.4868 | 0.4871 | 0.4875 | 0.4878 | 0.4881 | 0.4884 | 0.4887 | 0.4890 
2.3 | 0.4893 | 0.4896 | 0.4898 | 0.4901 | 0.4904 | 0.4906 | 0.4909 | 0.4911 | 0.4913 | 0.4916 
2.4 | 0.4918 | 0.4920 | 0.4922 | 0.4925 | 0.4927 | 0.4929 | 0.4931 | 0.4932 | 0.4934 | 0.4936 
2.5 | 0.4938 | 0.4940 | 0.4941 | 0.4943 | 0.4945 | 0.4946 | 0.4948 | 0.4949 | 0.4951 | 0.4952 
2.6 | 0.4953 | 0.4955 | 0.4956 | 0.4957 | 0.4959 | 0.4960 | 0.4961 | 0.4962 | 0.4963 | 0.4964 
2.7 | 0.4965 | 0.4966 | 0.4967 | 0.4968 | 0.4969 | 0.4970 | 0.4971 | 0.4972 | 0.4973 | 0.4974 
2.8 | 0.4974 | 0.4975 | 0.4976 | 0.4977 | 0.4977 | 0.4978 | 0.4979 | 0.4979 | 0.4980 | 0.4981 
2.9 | 0.4981 | 0.4982 | 0.4982 | 0.4983 | 0.4984 | 0.4984 | 0.4985 | 0.4985 | 0.4986 | 0.4986 


Student Table 


The table below gives tg such that P(|t(7)| < t@) = a, where t(n) is a Student 
distribution with n degrees of freedom. For instance, we read 


P(\t(5)| < 1.48) = 0.8. 


a 
n 0.6 0.7 0.8 0.9 0.95, 
1 1.38 1.96 3.08 6.31 12.71 
2 1.06 1.39 1.89 2.92 4.30 
3 0.98 1.25 1.64 2.35 3.18 
4 0.94 1.19 1.53 2.13 2.78 
R 0.92 1.16 1.48 2.02 2.57 
6 0.91 1.13 1.44 1.94 2.45 
i 0.90 1.12 1.41 1.89 2.36 
8 0.89 1.11 1.40 1.86 2.31 
9 0.88 1.10 1.38 1.83 2.26 
10 0.88 1.09 1.37 1.81 2.23 
11 0.88 1.09 1.36 1.80 2.20 
12 0.87 1.08 1.36 1.78 2.18 
13 0.87 1.08 1.35 1.77 2.16 
14 0.87 1.08 1.35 1.76 2.14 
15 0.87 1.07 1.34 1.75 2.13 
16 0.86 1.07 1.34 1.75 2.12 
17 0.86 1.07 1.33 1.74 2.11 
18 0.86 1.07 1.33 1.73 2.10 
19 0.86 1.07 1.33 1.73 2.09 
20 0.86 1.06 1.33 1.72 2.09 
21 0.86 1.06 1.32 1.72 2.08 
22 0.86 1.06 1.32 1.72 2.07 
23 0.86 1.06 1.32 1.71 2.07 
24 0.86 1.06 1.32 1.71 2.06 
25 0.86 1.06 1.32 1.71 2.06 
oo 0.84 1.04 1.28 1.64 2.01 
© Springer Nature Switzerland AG 2022 355 


R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8 


Chi-Squared Table 


The table below gives xq such that P(x (n) > xa) = a, where x (7) is a chi-squared 
distribution with n degrees of freedom. For instance, we read 


P(x) > 9.21) = 0.01. 


a 
n 0.10 0.05 0.01 
1 2.71 3.84 6.63 
2 4.61 5.99 9.21 
3 6.25 7.81 11.34 
4 7.78 9.49 13.28 
5 9.24 11.07 15.09 
6 10.64 12.59 16.81 
7 12.02 14.07 18.48 
8 13.36 15.51 20.09 
9 14.68 16.92 21.67 
10 15.99 18.31 23.21 
11 17.28 19.68 24.72 
12 18.55 21.03 26.22 
13 19.81 22.36 27.69 
14 21.06 23.68 29.14 
15 22.31 25.00 30.58 
16 23.54 26.30 32.00 
17 24.77 27.59 33.41 
18 25.99 28.87 34.81 
19 27.20 30.14 36.19 
20 28.41 31.41 37.57 
21 29.62 32.67 38.93 
22 30.81 33.92 40.29 
23 32.01 35.17 41.64 
24 33.20 36.42 42.98 
25 34.38 37.65 44.31 
© Springer Nature Switzerland AG 2022 357 


R. B. Schinazi, Probability with Statistical Applications, 
https://doi.org/10.1007/978-3-030-93635-8 


Index 


B 
Bayes estimator, 317-325 
Bayes’ Formula, 317, 318, 320 
Bernoulli random variable 
expectation, 30, 47, 219, 223, 311, 312 
variance, 33, 46-48, 114, 222-226 
Beta distribution, 182-184, 261, 319 
Binomial coefficient 
Pascal’s triangle, 45, 78 
Binomial random variable 
computational formula, 67-69, 71 
expectation, variance, 47, 54 
Binomial Theorem, 78, 79, 240 
Birthday problem 
Bivariate normal, 209-217 


Cc 
Central Limit Theorem (CLT) 
proof, 254-257 
Chi-squared distribution, 148, 150, 162, 163, 
262-263, 274-277, 293, 299, 300, 
333, 334, 341 
Chi-squared tests, 147-154 
Conditional distribution 
continuous case, 201—203 
discrete case, 197-201, 208 
Conditional probability, 9-24, 86, 87, 93, 198, 
199, 203, 215-217, 307, 317 
Confidence interval 
difference of means, 138 
difference of proportions, 130 
mean, 125, 134-138, 140, 145, 146 
proportion, 123-126, 128-130 


© Springer Nature Switzerland AG 2022 


Convergence in distribution, 252-257 

Correlation, v, 190, 192, 194-195, 209-211, 
229 

Coupling, 60, 231-240 

Covariance, v, 187-195, 210, 223, 224, 226, 
238, 239, 341 

Cumulative distribution function, v, 159-168, 
266 


E 
Expectation 
continuous random variable, 94, 98, 
105-108, 187, 201, 203 
discrete random variable, 29-33, 35, 37, 
197, 201, 205, 207, 208 
sample average, 110, 115, 273 
Exponential families of distributions, 301-306, 
309, 311, 315 
Exponential random variable 
memoryless property, 93-94 


F 
Factorial, 68, 70, 74, 179 
F distribution, 259-267, 333, 334 


G 

Gamma random variable, 179-182, 184, 185, 
259-262, 265, 266, 306, 316 

Geometric random variable, 26-31, 40, 54, 
245, 251, 256 

Goodness of fit test, 149-154 


R. B. Schinazi, Probability with Statistical Applications, 


https://doi.org/10.1007/978-3-030-93635-8 


360 


H 
Hemophilia, 15 
Hypothesis test, 119, 123, 139 


I 
Independence 
events, 15—18, 36, 147 
random variables, 26, 35-40, 48, 55, 171, 
187, 189-195, 243, 245, 261 


J 

Joint distributions 
continuous, 169-186 
discrete, 36 


L 

Law of Large Numbers, 30, 110, 116 
Least squares, 327-331, 336, 339 
Lognormal distribution, 164 


M 

Marginal densities, 169-171, 173, 176, 178, 
181, 201, 203, 217 

Matching problem, 221-224 

Maximum likelihood estimation, v, 284-290, 
295, 319 

Maximum of a sample, 116, 165-166, 
284-287, 289, 290 

Mean, see Expectation 

Mean squared error, 291-296, 303 

Median, 97-98, 108, 144, 145, 289, 290 
Minimum of a sample, 165-166, 289, 290, 
303-306 

Moment, v, 89, 206, 209, 224-226, 248-252, 
279-283, 295, 296 

Moment generating function, v, 241-257, 259, 
260, 274, 275 

Monte-Carlo integration, 116-117 


N 

Negative binomial, 53-55 

Normal approximation to the binomial, 48-54, 

59, 62 

Normal random variable, 49, 50, 53, 102-108, 
152, 163, 164, 174, 175, 209, 211, 
214, 216, 217, 246-248, 251, 256, 
262, 264, 270, 274-276, 331, 341 

Normal random vectors, 341-343 


Index 


P 
Pascal’s triangle, 45, 78, 79 
Poisson random variable 
approximation to a sum of binomials, 
60-63 
approximation to the binomial, 59-60 
mean, 57-59, 61-63, 69, 70, 149, 232-235, 
237, 239, 240, 244, 248, 252, 253, 
257 
scatter theorem, 58-59 
variance, 63-64, 305 
Posterior distribution, 317-325 
Prior distribution, 317-321, 324, 325 
Probability density function, 94, 180 
P-value, 120-123, 127, 128, 131-133, 135, 
136, 141-144, 148-152, 156, 
333-335 


R 
Random variables 
Bernoulli, 25—26, 30, 33, 46-48, 65-67, 71, 
86, 114-229, 231-232, 234, 235, 
237, 238, 241, 243, 282, 288, 301, 
307, 311, 312, 319 
Beta, 178-185 
binomial, 43-55, 59, 60, 67-70, 144, 200, 
219, 228, 235, 239, 243, 245, 251, 
252, 284, 307, 315 
Cauchy density, 174, 178 
Chi-squared, 185, 262—267, 276 
exponential, 92-93, 95, 97, 108, 113, 154, 
160, 165, 167, 174, 176, 177, 180, 
185, 243, 249, 260-261, 303, 324 
F, 159, 160, 164, 165, 167, 296 
Gamma, 179-182, 184, 185, 259-262, 265, 
266, 306, 316 
geometric, 26-31, 40, 54, 71, 245, 251 
lognormal, 164 
negative binomial, 54, 55 
normal, 49, 50, 53, 102-107, 152, 162-164, 
174, 175, 209, 211, 214, 216, 217, 
246-248, 251, 256, 262, 264, 270, 
274-276, 331, 341 
Poisson, 57-64, 69-70, 149, 228, 232-235, 
237, 239, 240, 244-245, 248, 252, 
253, 257, 305, 312 
Student, 139, 264, 277 
uniform, 65, 67, 90-91, 94, 98, 112, 113, 
116, 117, 160, 161, 164-167, 177, 
185, 231, 245, 251 
Weibull, 164 
Rao-Blackwell Theorem, 313, 314 


Index 361 


N) T 
Sample average Tests 
expectation, 273 goodness of fit, 149-152 
variance, 109-117, 145, 269-273, 276 independence, 147-149 
Simulation matched pairs, 142-143 
continuous random variables, 167 mean, 131-133, 145 
discrete random variables, 65-72 proportion, 119-122, 129 
Standard deviation, 33, 34, 38, 40, 52, 99, 103, sign, 143-146 
104, 106-108, 112-116, 131-141, two means, 135-136, 141-142 
143, 145, 146, 151, 152, 163, 168, two proportions, 126-128 
215, 246, 251, 272, 299, 302, 306, Transformation of a random variable, 
348 209 
Standard normal, 48-50, 53, 62, 102-107, Transformation of a random vector, 
111-114, 116, 120, 123, 127, 129, 172-176 


132, 134, 136, 139, 151, 152, 
154, 162-164, 168, 173-175, 192, 
209, 212, 214-217, 246-248, 251, U 
a cle ae Ae me As Unbiased estimators, 114-116, 292-293, 295, 
a oe ? 2 ? : 298-316, 334, 339, 343 
353-354 
Stirling’s formula, 74, 77 
Student random variable, 139, 264, 277 


Sufficiency, 307, 311-314 a: 
Sum of Bernoulli random variables, 221, 226 Variance, 33, 46, 63, 83, 98, 109, 123, 131, 
Sum of exponential random variables, 260-261 141, 168, 180, 191, 209, 222, 246, 


Sum of Poisson random variables, 244 263, 269, 280, 291, 303, 321, 331 


